The Latch SDK

the open-source toolchain to solve software in biology // dynamically compile type safe UIs from raw python // serverless pipeline execution // automatic containerization and versioning

Feb 16, 2022

Software in biology could use some help. Much has been written on this topic and Eliot Hershberg articulates it well -

This organizational structure for developing [biological] methods and software has resulted in a tsunami of unusable tools …scientists new to programming throw up their hands in confusion, and seasoned programmers tear their hair out with frustration.

In our experience building infrastructure for biotech startups and academic labs, we’ve identified three major problems that plague software in biology:

Un-maintained, un-documented, (often) un-runnable academic codebases :(
A lack of computational literacy amongst the end users, the biologists, who often struggle wielding such fragile tools.
Increasingly complex and expensive infrastructure - often TBs of file space and high-performance and/or GPU-enabled computing instances are required for modern -omics analysis.

This is the unholy Trinity of organic software. It has halted our progress in fundamental biology in both industry and academia alike.

Today we are releasing a tool to alleviate all three problems simultaneously - the Latch SDK - an open-source development kit to:

easily construct versioned, reproducible bioinformatics workflows in any language
dynamically compile type safe, no-code interfaces to execute workflows
scale workflows to cloud infrastructure sporting TBs of file space and arbitrary computing resources.

This is step 0 of the Latch project, towards our goal to build and disseminate the data infrastructure of the biocomputing revolution and our pledge to grow this platform to unleash unrealized productivity in the life sciences.

Latch SDK Features

Let us first summarize the main features of the Latch SDK from a high-level before diving into implementation decisions. For further information on usage, see our docs page.

Generate typed interfaces from python

The basic principle of the Latch SDK is that typed function headers can be easily compiled into interfaces. Each variable within the header maps to a distinct HTML input component. The type annotation applied to the variable informs both the style of the component and the validation rules applied to its input.

In the following image, we can clearly see an example of this translation, as each function parameter assumes an interactive counterpart embedded in a web application.

This dynamic generation of fully featured UIs from Python, a language largely understood and ubiquitous in the bioinformatics community, will allow no-code access to bioinformatics tools at scale. Previously unusable research repositories can be uploaded, published and shared as Latch interfaces, bespoke scripts transformed into executable links and passed around as easily as social media posts among academic circles. We hope that Latch interfaces become a drop-in replacement for bare code repositories in the research literature, where they will be far more interactive and remain stable decades after publication.

Customize interface layout and component style

To complement dynamic frontend generation, we have developed a domain language to allow fine-grained control of the interface layout and component styling. Developers can inject simple rules as plain YAML in their function doc-strings to have great control over the presentation of their interface to the end user.

Our domain language allows the following modes of control over interface layout:

Parameter ordering
Parameter display names and details
Semantic parameters grouping + subheaders
“Hidden” parameters (collapsable and out of sight by default)
Customizable tooltip messages
Default values and suggestions

Additionally, the styles on a specific parameter can be customized. For example, if a biologist needs to identify common CRISPR-Cas system cut-sites for a gene editing tool, a generic number input can be replaced with radio options corresponding to semantic choices, eg. –10, -3, +1. Both component presentations correspond to an “Integer” type, but the correct appearance can greatly enhance the user experience and ground input with biological context.

Automatic Versioning and Containerization

Each time a workflow is registered to the Latch platform, the Latch SDK containerizes and versions the code in the background. This built container then becomes the execution environment of the code when invoked through the interface.

Containerization can be both seamless or highly customizable. By default, container images are constructed by parsing the python runtime for imported libraries and generating best-effort pip install x directives or they can be defined with custom Dockerfiles. Similarly, versions are user-specified as any unique plaintext string.

This behavior is a strict requirement of the toolchain and gives us remarkable guarantees with respect to code reproducibility, portability and scalability. Containerized workflow logic can then be both duplicated en masse and run on heterogeneous computing environments. Additionally, each and every change is given a version-locked container for painless rollback.

The enforced containerization and versioning, and the ease at which it occurs, allows bioinformaticians to benefit from software engineering best practices without worrying about implementation details. The crisis of reproducibility, largely caused by poorly documented dependencies + software build requirements, can thus be heavily alleviated by these invariants enforced by the Latch SDK.

Hardware Specification

An enormous challenge in modern bioinformatics is deploying workflow logic on high-performance computing instances, capable of handling genome scale assembly or simulated protein folding. Though cloud platforms have facilitated access to such computers, considerable linux administration expertise - eg. file system mounting, dependency wrangling, driver installation - can still bar deployment and use of these powerful bio-tools — just ask anyone who has to maintain a CUDA/cuDNN cluster.

The Latch SDK allows deployment of the containerized workflow code described previously on enormous GPU-enabled or multi-threaded HPCs with a single python decorator.

As exposed resource specifications become more fine grained, and the scale + diversity of our instances swell with use, the platform will offer a true serverless computing experience (bringing far-reaching vision of our team’s alma mater to fruition ;). Higher fidelity hardware manipulation, e.g. declaring arbitrary cores and RAM requirements, is one of the first SDK improvements on our backlog.

Implementation Notes

A Rich Biological Type System

A well functioning static type system both guides the developer towards correct code, providing clear interfaces, and shows a user the exact well-defined behavior of a program, catching errors quickly and providing suggestions. It is therefore crucial that a type system correctly models the data flowing through a program to enable a developer to express complex features elegantly.

We at Latch are in the early innings of expanding canonical programming types to accommodate organic primitives.

With the early SDK, we have shipped fine-grained biological types using a pluggable rule-based data validation framework that is both simple and highly extensible. This validation sits on top of the native HTML form validation described earlier.

We simply construct regular expression patterns paired with failure mode messages as a JSON object and stuff the resulting structure into an annotated python type literal.

Nuc = typing.Annotated[
    str,
    LatchMetadata(
        {
            "rules": {
                "regex": "^[atcgATCG]+$",
                "message": "sequence must contain atcgATGC",
            }
        }
    ),
]

This annotated type literal can now be used freely within the Latch SDK to construct a new workflow interface.

@latch.task
def beat_covid(
  nucleotide_sequence: Nuc
) -> Cure:
    ...

And the compiled UI will use the regular expression to perform client-side validation with custom error presentation.

Quite simple and easily extensible to most organic data representations and biological file types.

FastQFile = typing.Annotated[
    LatchFile,
    LatchMetadata(
        {
            "rules": {
                "regex": "(.fastq.gz|.fastq)$",
                "message": "Only .fastq or .fastq.gz extensions are valid"
            }
        }
    ),
]

First-class Types Supported

For the sake of rigor here is a list of the non rule-based types supported by the Latch SDK.

The following familiar primitives:

None
Integer
Float
Boolean

And initial generics:

List[T]
Union[T, T]

With more coming soon…

Workflow Orchestration - our bet on Flyte

When looking for an orchestration engine powerful enough to become the backbone of our biocomputing stack, we searched far and wide for specific features we thought were mission critical:

first-class static typing
scheduling and resource allocation on a per-task (graph node) basis
a kubernetes-native execution engine

Flyte is the only framework that satisfies those constraints. The codebase is certainly in its nascent days, with much work ahead for its core team to transform the Lyft-incubated project into an out-of-the-box solution. Assembly was required and documentation, though improving, is still sparse. However, we believe Flyte has the model of workflow execution down to a tee. The representation of a task as a Kubernetes pod is the most flexible and intuitive way to construct workflows at scale.

For the first several months of use, our team has forked and extensively modified Flyte's core services for our production stack, primarily to expand the type system, build support for our interface layout domain language described earlier, and enable integrations with our managed data infrastructure.

However, we are working with the Flyte team to contribute our modifications directly to the project source, building type annotation support, adding generic union types, with more pull requests on the way to achieve parity with our production system.

We believe Flyte will earn its place among the ranks of Docker and Kubernetes as a foundational and ubiquitous component of future infrastructure stacks, and we will continue to support and work with the talented Flyte team to improve the project.

We'll close with the technical advantages of Flyte and its synergy with Latch.

The Importance of First-class Static Typing

Most workflow orchestration frameworks are bad because they are not type safe.

As codebases grow in size and engineering teams swell in ranks, world-class organizations have unilaterally switched to verified type hinting or statically typed languages to tame complexity. Type checking can catch many bugs and make code more legible, easing long-term maintenance and comprehension.

Similarly, while early workflow orchestration frameworks, such as nextflow and WDL, might be more conducive to quick prototyping and mockups, the lack of type verification will cripple their reliability and execution at scale. This issue is exacerbated if our end user is a wet lab biologist, not a Stripe developer, attempting to wrangle complex biological values into workflow inputs.

Type checking also increases the robustness of workflow code itself, just as the interaction between a developer and a compiler improves code quality in traditional programming.

Flyte is the only modern orchestration framework with first-class static typing. It builds on a language-agnostic type system constructed in raw protobuf.

"variables": {
  "strandedness": {
    "type": {
      "collectionType": {
        "enumType": {
          "values": [
            "reverse",
            "forward"
          ]
        }
      }
    },
  },
  "sample_ids": {
    "type": {
      "collectionType": {
        "simple": "STRING"
      }
    },
  },
  "samples": {
    "type": {
      "collectionType": {
        "collectionType": {
          "blob": {}
        }
      }
    },
  }
}

(JSON-serialized protobuf messages representing a group of biological parameters)

This type system can be defined and mapped from the native type systems of languages like python. Notice how the serialized parameter schema represented above can be derived from the python block below.

from typing import List
from enum import Enum
from latch import task
from latch.types import LatchFile


class Strandedness(Enum):
    reverse = "reverse"
    forward = "forward"

@task
def nf_rnaseq_tsk(
    strandedness: List[Strandedness],
    sample_ids: List[str],
    samples: List[List[LatchFile]],

(A workflow snippet defined with the Latch SDK that generated the protobuf above)

These statically typed and language-agnostic parameter schemas generated from the Latch SDK are fed into a parser written in pure Typescript on the client-side, which then builds the interface representation of the workflow dynamically.

K8s-native and Task Independent Scheduling

A Kubernetes-native workflow engine gives developers the illusion of infinite compute while also culling underutilized cloud instances. By maintaining a sea of heterogeneous node groups, the kube-scheduler can be leveraged to locate or provision compute for task execution with arbitrary resource requirements.

Furthermore, by deploying the task as a Kubernetes pod, automatic redeployment and replication via the battle-tested controller pattern gives task execution high reliability even if it’s scheduled node completely fails.

Independent scheduling on a per-task granularity also allows for creative orchestration schemes that are useful in biological workflows. For example, sequence alignment on highly threaded machine followed by machine learning inference on the resulting assembly using a distinct, GPU-enabled machine is both possible and highly practical under this paradigm.

We want you to get your hands on the Latch SDK to build your own cloud-native bioinformatics applications in lines of python.

Onboarding from the early-access waitlist beings today - sign up to begin taming the flow of nucleotides with silicon and join us in the construction of the biocomputing infrastructure of tomorrow.

More information here

LatchBio

Discussion about this post