Scientific leadership should be responsible for data platforms
The CSO should drive the deployment and regular use of data platforms within a biotech organization rather than a distinct engineering lead. This is both because the requirements for the platform should be set by the biological goals of the company and adopting such a platform in practice requires strong scientific leadership to change the daily behavior of researchers.
The current standard is to delegate this responsibility to a computational lead who is further removed from the scientific goals of the organization by training and detached from the daily behavior and needs of the end users. This has led to the widespread deployment of bloated, buggy, and often disjointed platforms that are never truly adopted by research teams and take many years and resources to stand up. Thus the research productivity gained from a single collaborative environment for access and exploration of all experimental data over time is simply unrealized for many biotech organizations.
These general claims are made from observing the internals of hundreds of biotech companies. CSO is used interchangeably with scientific leadership throughout this essay and describes the person or team that plans drug programs.
How data platforms change research
Modern biotechs are sprawling organizations, composed of teams spanning broad scientific disciplines, often siloed from one another by drug programs and working across different states or countries. More than just vehicles for advancing new science, these companies are complex human engineering projects with technical communication barriers and geographic or project-based siloes that pose new challenges to the free information flow necessary for scientific collaboration.
Biotechs live and die by the speed at which they can draw meaningful conclusions about the biology of their drug candidates and their interaction with disease models. But understanding “how good” a drug candidate is requires synthesizing data, analysis and expert interpretation across these boundaries. The already challenging team orchestration problems involved in building such layered images of drug viability are compounded by the rise of sequencing, proteomics and imaging as standard tools to interrogate new biology. These new methods spit out streams of large data, mostly uninterpretable in their raw form, that need to be processed with large computers and integrated with traditional biochemical assays.
There is clearly a need for some platform to centralize and integrate different streams of experimental data for biological consensus across teams at each biotech, but it is important to first acknowledge the simplicity and the power of this goal. We are not looking for some magic machine learning component to extract hidden insight from experimental data that is beyond the rational interpretation of a trained scientist or some mathematical technique to tell us how to design programs and drugs. The need is for basic software plumbing - a platform that can answer questions such as “given this set of sequencing runs and biochemical experiments on this and that day, what were their processed readouts” and do it in a way where any team at any time with any level of computational fluency can ask and understand.
A seasoned leader of a biotech organization will immediately appreciate the impact on the speed and quality of R&D from a platform that delivers on this goal. Instant access to data breeds an environment of autonomous and rapid hypothesis generation as scientists no longer have to worry about the “how” of acquiring data, by tracking down the right person or playing email tag with the bioinformatics core, but can focus on and play with the data itself. This leads not only to faster analysis but fundamentally new classes of insights as the ability to retrieve historic experiments and layer different experimental modalities, eg. looking at a consensus between qPCR, bulk RNA-seq and flow cytometry, create richer images of the biomolecular systems under study. A single platform also encourages free-flowing collaboration amongst teams as it is a shared and familiar medium for all.
This is a somewhat intentionally abstract system, as the needs of each company will differ from one another. In all cases, the software infrastructure and ecosystem of tools that capture and expose experimental data to the entire company, is what we will refer to as a “data platform”. It is a window into the sum total of experimental data generated by a biotech, accessible to everyone from bench scientists to computational biologists to the C-suite.
If every biotech had access to a functioning data platform that was used and trusted by scientists, it would usher in a golden age of research productivity across the industry. But the reality is far from this. Most biotechs completely outsource all analysis, with long delays and no ability to question results. Others have attempted to build internally, but often end up with platforms that aren’t truly used and take years to stand up. There are a handful of funded and well-known outliers that have assembled enormous engineering teams and successfully built platforms, but at the cost of millions of dollars and the untold opportunity cost of building actual drugs. This is wasteful and slowing the pace of innovation in the industry.
Why data platforms are largely broken across biotech
Scientific leadership at most biotechs have given data platform responsibility to engineering leads who are somewhat removed from the economic bottom line of the business and do not immediately understand the biological goals of a platform or the usability needs of scientists.
Engineering is divorced from science
In an industry where the resources to screen just one more library can yield a molecule worth over a billion dollars and be the difference between bankruptcy and cures for disease, every allocated dollar needs to be made in active tradeoff with scientific goals. However, the resources for data infrastructure are often not considered to be from the same pool as those feeding general R&D. Biotechs create largely autonomous engineering teams that control the allocation of capital and staffing resources equivalent to the cost of one or more entire drug programs.
This economic dissociation also leads to slow or misaligned timelines. Platform planning is not done in lockstep with program milestones so they are not ready when they are needed. Ideally a strict set of capabilities, eg. the type of analysis and scientific questions ranked by the most urgent experiments and biological needs, are prioritized so that functioning systems are online the moment experimental data is streaming from machines. But this discipline is quite difficult without scientific leadership and the separation described makes this very rare in practice. Platforms are generally behind schedule and often by a significant amount of time.
Internal tools are difficult to build
Somewhat overlapping with scoping and timelines is the problem of platform bloat. In the absence of the requirement to collect dollars from actual customers, internal engineering teams have to be incredibly diligent in building their products - obsessively collecting requirements, regular user testing with scientific teams, rigorous backlog prioritization and constant truth seeking. They are, after all, software teams building tools that are used by a group of users (scientists) they are not intimately familiar with (eg. not themselves) so they are probably building the wrong thing most of the time. These product management structures are underdeveloped if not completely absent within biotechs. Engineers enjoy building things and building things without the aforementioned accountability is every engineer's dream (the author included). Further, many small biotechs simply lack the resources or dedicated team members to begin to build tools. The lack of strict scientific oversight tightly scoping platform requirements has led to feature creep, and therefore time and resource waste.
Why does this happen?
The most probable reason is that many do not even recognize that this is a problem. They see other companies around them with similar or more broken data platforms. They do not place the same expectations and urgency around a functioning platform as they do core scientific functions.
Another reason is the structure of a biotech itself. These are unique and challenging businesses to build. They are in many ways more insulated from market forces than other types of companies, with longer expected time to streams of revenue and very heavy R&D spend. This means it is easier for biotech leadership to overspend and mismanage projects like data platforms than other businesses. It is also more difficult to diagnose the poor execution of such a project, either from frivolous spend or, more subtly, the lack of working systems needed to truly understand experiments, as a contributing and maybe causal factor to company restructuring or death.
There is also an element of technical pride in building internal computer systems, pressure from venture to show differentiation and an ambition to follow in the footsteps of larger biotech companies they respect. The author has heard many biotech founders describe similar dreams of becoming the next Recursion that drive the decision to build.
Scientific ownership leads to useful and economic data platform buildout
Engineering reports to science
The natural way to address these problems is to create a transparent reporting structure between science and engineering. Scientific leadership creates timelines for the data platform dictated by drug program needs, requirements for platform capabilities based on urgent biological goals of the company and ensures that scientists are trained and regularly using it. They care not about the tools, languages, vendors or any technology engineers choose to use.
This creates a healthy separation of product definition from engineering that allows for greater focus, speed and accountability. Scientific leadership does not need domain knowledge in computer science or engineering to define the biological goals, usability requirements and timelines of the platform. In fact, they are the best equipped to do so, contrary to current practice.
Platform requirements are defined by scientific goals
Scientific leadership should define the requirements for their data platforms by looking at the anatomy of their drug programs and identifying each point where experimental data is used to answer core biological questions. The best way to do this is to draw a graph on a piece of paper that represents the program - draw circles for each experiment and each team, and draw edges between the circles that depend on each other for information.
For each edge, usually a team-team or team-experiment relationship, the CSO should put themself in the shoes of the scientist and think about what they care about. For instance a bulk RNA-seq to molecular biology team edge would require identifying differentially expressed genes and relating them to well known pathways and functions. This is a data platform requirement. A platform team to immunology team edge would require some shared medium for handing off and communicating about sequencing and flow cytometry data - the team synthesizing and screening libraries needs to compare notes and hypothesize with the team validating leads with standard flow assays. This is a data platform requirement.
The goals and timelines are now clear. Scientists need to be able to accomplish these concrete things at the time when the experiments are ready. These are the requirements the CSO hands to the engineers.
User testing with scientists is a priority
The CSO should make sure the engineering team is regularly user testing and spending time with their research team - shadowing their work at the bench, understanding what they care about and overall building empathy for experimental biology and the scientists who practice it. This is the only way that the platform will be truly useful for experimentalists.
Scientific leadership drives scientists to adopt new tools
Scientists are especially conservative in adopting new tools and averse to change, partially because they have learned restricting sources of variability in behavior often leads to more reproducible research. They need strong scientific leaders they trust and respect to guide the adoption and regular use of a new data platform. Existing computational leads are rarely in a position to actually influence the behavior of researchers.
The utility of new tools with organization-scale effects on productivity are best appreciated by company leaders. They have the economic goals and the aerial perspective of the company needed to recognize value more difficult to see short term or within the context of a single team.
Shared budgeting with R&D leads to rational budgeting decisions and timelines
A CSO that considers the budgeting and plans the timelines for both drug programs and data platforms together will likely make more economical decisions for a biotech.
Resources allocated towards building platform infrastructure, decisions around hiring internal teams of engineers, will be seen in direct tradeoff with new experiments and research initiatives that have a higher chance of increasing program viability or even constructing new programs.
The timelines to stand up the platform and train researchers to use it will be defined by the actual scientific needs of program milestones, eg. bulk RNA-seq analysis needs to be operational and usable when sequencing data is ready. This will tighten timelines but also cut the scope of what biological analysis and questions the platform should be capable of answering.
Acknowledgements
Thank you to Dillon Flood (ElsieBio), Jackson Brougher (Doloromics), Brian Naughton (Hexagon Bio), Kyle Giffin, Tahir DMello, Bronte Kolar (LatchBio) for their thoughtful comments and review.