Engineering Plastic-Degrading Enzymes and PCSK9 Binders with Protein AI Tools
A protein engineering toolkit for molecular design // Engineering enzymes to break down plastic with ProtGPT2, OmegaFold, TemStaPro // Designing drugs with RFDiffusion, ProteinMPNN, ColabFold
In the past year, a new class of machine learning tools for protein engineering promise to change how we develop drugs and enzymes. However, in these early days, the industry is still trying to understand how to use these tools to do concrete things.
After working with cutting-edge academic labs and computational biotech teams on real problems in research and industry, we’ve assembled a curated toolbox of graphically accessible protein engineering tools and demonstrate how to string them together for useful molecular design tasks.
Here we build two different proteins - plastic-degrading enzymes and blood cholesterol drugs - each with distinct scientific goals to show the flexibility of the toolkit and breadth of utility.
To build enzymes we:
1. Finetune ProtGPT2 on enzyme sequences
2. Use finetuned ProtGPT2 to generate a library of novel enzymes
3. Create 3D protein structures with OmegaFold
4. Predict thermal stability with TemStaPro
5. Calculate electrostatic potential and hydrophobicity with PEP-Patch
6. Evaluate aggregation propensity with Aggrescan3D
To build PCSK9 binders we:
1. Use Pymol to identify the region of PCSK9 we want to bind
2. Diffuse 10 scaffolds near binding hotspots on PCSK9 with RFDiffusion
3. Use ProteinMPNN to generate 100 sequences for these scaffolds
4. Predict 3D protein structure with ColabFold
We trace the actual end-to-end engineering flows from ideation to structure and use believable scientific details to contextualize the problems in familiar biology. We hope this post serves as a resource for scientists and engineers navigating this new ecosystem of models.
Engineering Plastic Degrading Enzymes
Global production of plastics has increased from 1.3 million tons in 1950 to 359 million tons in 2018. [1] However, most plastics do not break down naturally in a significant way, often persisting for hundreds of years. [2] Because of this accumulation of plastics and microplastics in our environment, food and bodies, there is a widespread interest in developing methods for clean, stable and scalable enzymatic degradation of plastics.
In this first flow, we generate a library of candidate enzymes that might be able to degrade plastics. We not only predict structures and sequences but also attempt to predict protein thermal stability, electrostatic potential, hydrophobicity and aggregation propensities for each enzyme generated.
Flow Anatomy
The steps are as follows:
Finetune ProtGPT2 on a functionally diverse but well curated set of plastic-degrading enzyme sequences
Use finetuned ProtGPT2 to generate a library of novel enzymes
Create 3D protein structures for the library of enzymes with OmegaFold
Predict thermal stability of enzymes with TemStaPro
Calculate electrostatic potential and hydrophobicity of enzymes with PEP-Patch
Evaluate aggregation propensity with Aggrescan3D
Organize and filter results
1. Model Fine-tuning
ProtGPT2 is a decoder-only model built on the GPT2 Transformer architecture, consisting of 36 layers with a dimensionality of 1280 and totaling 738 million parameters. It has been pre-trained on UniRef50, a clustered protein database, using a causal modeling approach where it learns to predict the next token (or oligomer) in a protein sequence. This training method allows ProtGPT2 to develop a deep understanding of protein structures and functions, effectively enabling it to "speak" the protein language. [3]
The model has demonstrated the ability to generate sequences that retain critical features of natural proteins—such as amino acid propensities, secondary structure, and globularity—while also exploring previously unseen regions of the protein space. This makes ProtGPT2 an ideal tool for generating diverse and potentially innovative enzyme sequences.
Choosing the right data for fine-tuning was essential. To generate novel enzymes, we needed a diverse but functional dataset of enzymes known to degrade various types of plastics. We compiled a list of enzymes using BRENDA (world's most comprehensive online database for functional, biochemical and molecular biological data on enzymes, metabolites and metabolic pathways) [4]. Among others, this includes:
Poly(ethylene terephthalate) hydrolase [EC 3.1.1.101] - PETase enzyme that specifically targets Polyethylene terephthalate (PET) plastic.
Mono(ethylene terephthalate) hydrolase [EC 3.1.1.102] - MHETase enzyme, working alongside PETase to further break down PET.
Alkane 1-monooxygenase [EC 1.14.15.3] - Enzyme involved in the degradation of polyethylene (PE).
We then retrieved sequences for all proteins classified under these enzyme categories from UniProtKB, [5] resulting in a FASTA file with 32,434 sequences for fine-tuning. The fine-tuning was conducted using default parameters on eight V100 GPUs via a Latch Workflow designed to support both zero-shot and fine-tuned predictions with ProtGPT2.
2. Sequence Generation
Following the fine-tuning, the model was used to generate an initial library of 1,000 sequences, each capped at 1,000 amino acids.
3. Structure Prediction with OmegaFold
Most recent advancements in deep learning based protein structure prediction tools have traditionally relied on evolutionary information from multiple sequence alignments (MSAs) to accurately predict structures. However, for enzymes designed to function outside of their native organisms—such as those aimed at degrading plastics in environmental or industrial settings—MSAs may not always be available or reliable due to a lack of evolutionary data and context.
OmegaFold addresses this challenge by accurately predicting high-resolution protein structures using only a single primary sequence, without the need for MSAs. It combines a protein language model, which makes predictions from single sequences, with a geometry-inspired transformer model trained on known protein structures. This approach allows OmegaFold to achieve similar prediction accuracy to AlphaFold2 and outperform RoseTTAFold, particularly in cases where evolutionary information is limited or noisy. [6]
For this flow, a Latch Workflow for OmegaFold was used to predict the 3D structures of the 1,000 enzyme sequences generated from the fine-tuned ProtGPT2 model.
This process converted the sequences into PDB files, providing initial structural models for further analysis.
While this method is a valuable step in evaluating the potential of these enzymes, it is important to recognize the limitations: the accuracy of these structural predictions, particularly for entirely novel sequences, may not always be guaranteed. Therefore, further experimental prediction, validation and refinement are necessary.
To explore the potential of the generated enzyme sequences, we then used a range of computational tools to predict key properties and stability of protein sequences.
4. Thermostability Prediction with TemStaPro
We used a Latch Workflow for TemStaPro in bulk to predict the thermal stability of the generated enzyme sequences, estimating their stability across a range of temperature thresholds (40°C to 80°C).
The tool generated detailed TSV files with binary and raw predictions for each enzyme, providing insights into which enzymes may remain stable under conditions relevant to industrial and environmental applications.
Predictions were made for mean thermal stability across the entire sequence, as well as per-residue predictions to pinpoint specific areas of instability within each enzyme. [7]
In future iterations of this process, these insights can guide sequence modifications to improve thermostability, either through rational design or further fine-tuning of ProtGPT2.
5. Aggregation Propensity with Aggrescan3D
To evaluate the aggregation tendencies of the 3D structures generated by OmegaFold, Aggrescan3D was used in bulk on all the PDB files through a Latch Workflow. [8]
This tool identifies aggregation propensity by analyzing residue-level interactions and looking at the geometry of the protein surface. This is often overlooked by methods that just look at the sequence alone. The output provided an A3D score for each residue. As a quick summary measure for aggregation propensity of each generated enzyme, we used the median of the protein’s residue A3D scores.
This information can highlight potential aggregation issues, guiding further structural optimization.
6. Electrostatic Potential and Hydrophobicity with PEP-Patch
Understanding the electrostatic and hydrophobic properties of enzymes is crucial for predicting their interactions with substrates and behavior in different environments.
PEP-Patch on Latch generates outputs to visualize and quantify these properties on the surface of the predicted enzyme structures. These insights help identify and prioritize surface features that are most likely to support efficient substrate interaction. [9]
Future modifications can focus on enhancing these properties to increase the likelihood of effective plastic degradation in various environments.
7. Organize and Filter Results
To bring all the findings together, we used a Latch Pod to compile the results into a table in Latch Registry.
The table includes each generated sequence, dynamic links to structure files, and key predicted properties and files for thermostability values, aggregation scores, electrostatic and hydrophobic characteristics.
By organizing the data this way, we created a clear, centralized resource that makes it easier to analyze and select the most promising enzyme candidates for further refinement.
To narrow down the enzyme candidates, we applied filters to the table, focusing on those that were thermally stable at 40°C, had an aggregation propensity between -0.5 and 0.2—indicating an optimal balance of aggregation and stability—and were longer than 400 amino acids.
This filtering process helped to efficiently identify the most promising enzymes for further refinement and experimental testing, ensuring that only the best candidates move forward in the development cycle.
Designing De-Novo Binders for PCSK9
In this second flow, we turn to drug development and use our toolkit to build a protein binder. Our mock target is PCSK9, a protein involved in cardiovascular disease and of active interest in several modern drug programs. PCSK9 binds to LDL receptors (LDL-R) on liver cells, causing their degradation and reduces their ability to remove LDL cholesterol from the bloodstream. A PCSK9 binder prevents this interaction, allowing more LDL receptors to be recycled to the cell surface and potentially lowering blood cholesterol levels. [1]
Flow Anatomy
To arrive at candidate binders, we will start with scaffold design and end up generating 100 potential binder structures in the following steps:
Use Pymol to identify the region of PCSK9 we want to bind
Diffuse 10 scaffolds near binding hotspots on PCSK9 with RFDiffusion
Use ProteinMPNN to generate 100 sequences for these scaffolds
Predict 3D protein structure for each sequence using ColabFold and explore the structure of the binders a complex with PCSK9
1. Identifying the binding region
Let’s take a look at PCSK9. We first navigate to the RCSB Protein Data Bank and download a PDB file of 2W2M, which stores the structure of a complex of PCSK9 bound to LDL-R. [7]
Opening this up in PyMol (a protein visualization tool) we see two chains representing PCSK9 (Chain A - dark blue, Chain P - light blue) and one representing LDL-R (Chain E - gray).
We will now highlight the binding regions on PCSK9 Chain A where LDL-R binds. Our hypothesis is that if we can design a binder to bind to these hotspot regions, we can prevent the interaction of PCSK9 and LDL-R.
From a literature search, we identified amino acids are critical hotspot residues in the 370-385 region of PCSK9 Chain A. [7]
Asp374 (red)
Thr377 (yellow)
Phe379 (green)
Below, you can see these regions highlighted:
2. Designing a binder scaffold with RFDiffusion
Now that we’ve identified our target region and hotspots, RFDiffusion can be used to create a structure to scaffold this location. RFDiffusion is a protein design tool that uses diffusion models to generate novel protein structures. Using RFDiffusion, we can use the PCSK9 protein structure as a template and diffuse a structure around our hotspot region. This will serve as the backbone of our binder designs.
Using the 2W2M PDB file above as our input structure file, we can generate potential binder scaffolds on Latch using parameters in the RFDiffusion workflow to specify the regions of PCSK9 that we want to use.
These parameters launched a protein design workflow which created 10 potential binder structures, each up to 100 residues long, that were designed using Chain A (residues 370-395) of PCSK9 as a template. [8]
Let’s take a look at one of structures generated by RFDiffusion: PCSK9_binder_3.pdb. Two chains were generated: 1) Chain A, in purple, is the diffused scaffold and 2) Chain B, in blue, is the region from PCSK9 that the scaffold was designed against. Essentially, Chain A is the generated binder structure.
Here, we can see how it aligns to the original PCSK9 A Chain, with and without hotspots annotated:
There are plenty of parameters that we didn’t explore here from RFDiffusion that allow you to diffuse far more complex scaffolds than this. [9]
3. Sequence generation with ProteinMPNN
Now that we have binder structures, the next step is to generate an amino acid sequence sequence that folds to this backbone structure. ProteinMPNN is a powerful tool that does exactly this - given a backbone structure, it generates protein sequences.
We use the design from the previous section, PCSK9_binder_3, for the rest of this analysis. We feed the PDB file generated by RFDiffusion to ProteinMPNN on Latch and generate 100 sequences for the binder chain.
Within moments, we generate FASTA file containing 100 sequences that could plausibly fold to our binder.
ProteinMPNN also provides a series of metrics for each generated sequence. The score measures the model’s confidence in the amino acids it chose for the designed parts of the protein, where lower is better. A low global score indicates that the complete protein sequence is more likely to be stable and functional [10] [11].
4. Predicting binder protein structure with ColabFold
The final step is to predict the structure of these generated sequences in a complex with the PCSK9 A Chain and inspect the interface between them. ColabFold combines the fast homology search of MMseqs2 with AlphaFold2 to predict the structure of protein structures and complexes [12]. It’s worth noting that there are multiple structure prediction models with their own strengths. AlphaFold2 has shown to be one of the most accurate models for binder design, especially with its initial guess support [4], and has been used in conjunction with All Atom versions of these tools (LigandMPNN, RFDiffusionAA) to design heme-binding proteins [13, 14]. ColabFold is incredibly fast and it predicts structures within minutes, so we opted for it here.
On Latch, we first concatenated the original PCSK9 A Chain sequence onto the sequence of each sequence generated by ProteinMPN to form a series of multimer sequences:
With the sequences ready, we can launch them all through ColabFold and finally we have a de-novo binder structure. This protein does not exist in nature and was built entirely on computers.
For each of these predicted structures, ColabFold also outputs a number of metrics that from here could be used to evaluate promising structures. pLDDT (predicted Local Distance Difference Test) is outputted for every residue in the output structure and measures the confidence in local structure prediction for each residue. To evaluate complexes, we can look at iPTM (interface Predicted TM-score) values, which assess the quality of the predicted interface between chains in a complex.
We could also take a look at some of the PAE plots to explore relative confidence [15]. Below is one for the structure of sequence #24 generated by ProteinMPNN.
To Conclude
While these exercises are demonstrations, and not novel science, they highlight a realistic use of machine learning to develop molecular tools and drugs. Models like RFDiffusion, ProteinMPNN, and OmegaFold, and their developer communities, are transforming the drug discovery process. We were able to use the toolkit to build a library of protein “ideas” from a custom datasets along with sequences, structures, and physical/chemical properties within a couple of hours for two very different tasks.
However, this is only the first part of a larger iterative process. Rational manual inspection followed by experimental validation in the wet lab is crucial to verify the predictions and determine whether these proteins perform as intended. Ultimately, we acknowledge the combination of computational insights with hands-on validation in the wet lab is what will drive meaningful advancements, bridging the gap between innovative protein design and real-world applications.
This project also highlights the power, scalability and flexibility of the suite of graphical protein tools on Latch. These managed tools allow scientists to generate, model, and evaluate custom libraries of protein to quickly test hypotheses and design molecules.
—
This work was spearheaded by Tahir DMello and Bronte Kolar, bioinformatics leads at LatchBio. Their intelligence, hard work and passion for science is contagious.
Want to work with them on your next protein engineering project? Book a Demo.
Citations
Designing De-Novo Binders for PCSK9
Dauparas, Justas, et al. "Robust deep learning–based protein sequence design using ProteinMPNN." Science 378.6615 (2022): 49-56.
Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." nature 596.7873 (2021): 583-589.
Watson, Joseph L., et al. "De novo design of protein structure and function with RFdiffusion." Nature 620.7976 (2023): 1089-1100.
https://meilerlab.org/wp-content/uploads/2024/07/rfdiffusion.pdf
https://github.com/RosettaCommons/RFdiffusion/blob/main/examples/design_ppi.sh
https://meilerlab.org/wp-content/uploads/2022/12/protein_mpnn_tutorial_Nov2022.pdf
https://github.com/dauparas/ProteinMPNN/blob/main/examples/submit_example_3.sh
Mirdita, Milot, et al. "ColabFold: making protein folding accessible to all." Nature methods 19.6 (2022): 679-682.
Krishna, Rohith, et al. "Generalized biomolecular modeling and design with RoseTTAFold All-Atom." Science 384.6693 (2024): eadl2528.
Engineering Plastic Degrading Enzymes
Page MM, Watts GF. PCSK9 inhibitors - mechanisms of action. Aust Prescr. 2016 Oct;39(5):164-167. doi: 10.18773/austprescr.2016.060. Epub 2016 Oct 1. PMID: 27789927; PMCID: PMC5079795.
Zhang K., Hamidian A.H., Tubić A., Zhang Y., Fang J.K.H., Wu C., Lam P.K.S. Understanding plastic degradation and microplastic formation in the environment: A review. Environ. Pollut. 2021;274:116554. doi: 10.1016/j.envpol.2021.116554.
Magalhães S., Alves L., Medronho B., Romano A., Rasteiro M.D.G. Microplastics in Ecosystems: From Current Trends to Bio-Based Removal Strategies. Molecules. 2020;25:3954. doi: 10.3390/molecules25173954.
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022). https://doi.org/10.1038/s41467-022-32007-7
Chang A., Jeske L., Ulbrich S., Hofmann J., Koblitz J., Schomburg I., Neumann-Schaal M., Jahn D., Schomburg D.
BRENDA, the ELIXIR core data resource in 2021: new developments and updates. (2021), Nucleic Acids Res., 49:D498-D508. DOI: 10.1093/nar/gkaa1025 PubMed: 33211880
The UniProt Consortium , UniProt: the Universal Protein Knowledgebase in 2023, Nucleic Acids Research, Volume 51, Issue D1, 6 January 2023, Pages D523–D531, https://doi.org/10.1093/nar/gkac1052
Ruidong Wua, Fan Dinga, Rui Wanga, Rui Shena, Xiwen Zhanga, Shitong Luoa, Chenpeng Sua, Zuofan Wua, Qi Xieb, Bonnie Bergerc, Jianzhu Maa and Jian Penga High-resolution de novo structure prediction from primary sequence. bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; posted July 22, 2022.
TemStaPro: protein thermostability prediction using sequence representations from protein language models. Ieva Pudžiuvelytė, Kliment Olechnovič, Egle Godliauskaite, Kristupas Sermokas, Tomas Urbaitis, Giedrius Gasiunas, Darius Kazlauskas Bioinformatics, Volume 40, Issue 4, April 2024, btae157, https://doi.org/10.1093/bioinformatics/btae157
Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility
Aleksander Kuriata, Valentin Iglesias, Jordi Pujols, Mateusz Kurcinski, Sebastian Kmiecik, Salvador Ventura Author Notes
Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W300–W307, https://doi.org/10.1093/nar/gkz321
PEP-Patch: Electrostatics in Protein–Protein Recognition, Specificity, and Antibody Developability
Valentin J. Hoerschinger, Franz Waibl, Nancy D. Pomarici, Johannes R. Loeffler, Charlotte M. Deane, Guy Georges, Hubert Kettenberger, Monica L. Fernández-Quintero, and Klaus R. Liedl
Journal of Chemical Information and Modeling 2023 63 (22), 6964-6971
DOI: 10.1021/acs.jcim.3c01490