Function guided in silico protein design
Inventors
Gligorijevic, Vladimir • Bonneau, Richard • CHO, KyungHyun
Assignees
Simons Foundation • Genentech Inc • New York University NYU
Publication Number
US-12131801-B2
Publication Date
2024-10-29
Expiration Date
2042-05-16
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A protein design system includes one or more processors configured to modify, by a modifier, an input sequence corresponding to a protein, the input sequence comprising a data structure indicating a plurality of amino acid residues of the protein; map, by an encoder, the modified sequence to a latent space; predict, by a length predictor, a length difference between the mapped sequence and a target sequence based on at least one target function of the target sequence; identify, by a function classifier, at least one sequence function of the modified sequence; transform, by a length transformer, the modified sequence based on the length difference and the at least one sequence function; and generate, by a decoder, a candidate for the target sequence based on the transformed sequence.
Core Innovation
The invention provides a protein design system utilizing a coupled set of machine learning models, including an autoencoder, a length prediction model, a length transformation model, and a function classification model. This system operates by modifying an input protein sequence (such as by insertion, deletion, or modification of amino acid residues), then encoding this modified sequence into a latent space where structural and functional relationships between protein sequences are preserved. The encoder is trained so that its output reflects the relationship of the sequence to known populations of protein sequences that share structural and/or functional similarities.
A key feature is the prediction and transformation of sequence length, enabling the design of protein candidates of variable length, which traditional fixed-backbone approaches struggle with. The system predicts a length difference to a target sequence (associated with at least one target function), transforms the latent representation accordingly, and then evaluates the resulting designed sequence for target functionality using a function classification model. Sequences meeting predefined functional thresholds are decoded to produce candidate protein sequences in a data structure indicating their amino acid residue composition.
This approach addresses the problem of exploring the vast and sparsely functional protein sequence space for de novo protein design. Conventional methods rely on fixed backbone architectures and can be computationally intensive while often failing to incorporate explicit functional guidance. By directly incorporating functional classification and probabilistic sampling within a learned manifold of protein sequence space, the system offers a computationally efficient pathway for generating novel protein sequences with specified functional attributes, aiding applications such as large molecule drug discovery.
Claims Coverage
There are three principal independent inventive features covered by the claims: the protein design system, a computer implemented method, and a non-transitory computer readable medium storing instructions for sequence generation.
Protein design system using coupled machine learning models for sequence modification, latent space embedding, and function-guided design
The system comprises: - An autoencoder coupled with a length prediction machine learning model, a length transformation machine learning model, and a function classification machine learning model. - Operations executed by one or more processors, which include: - Modifying an input sequence comprising amino acid residues (including possible insertion, deletion, or substitution of residues). - Using an encoder to generate a latent space sequence representation, wherein the latent space position indicates relationships to populations of sequences with structural/functional similarity. - Predicting a length difference between the latent representation and a target sequence based on at least one function of the target sequence. - Applying the length difference via a length transformation model to produce a length-transformed latent representation. - Identifying at least one sequence function of the transformed representation with a function classification model. - Upon meeting one or more functional thresholds, generating the target sequence by decoding the transformed latent representation to obtain a data structure of amino acid residues.
Computer implemented method for generating a target protein sequence using function-guided sequence modification and transformation
The method involves: - Modifying an input protein sequence as a data structure indicating amino acid residues. - Generating a latent space sequence representation using an autoencoder-based encoder. - Predicting a length difference between this representation and a target sequence, based on at least one target function. - Applying a length transformation based on the predicted difference. - Using a function classifier to identify at least one sequence function in the transformed latent representation. - If the function(s) satisfy thresholds, generating the target sequence by decoding the transformed representation into a data structure indicating the respective amino acid residues.
Non-transitory computer readable medium with instructions for automated function-guided protein sequence generation
The medium stores instructions that, when executed, cause a system to: - Modify an input sequence indicating amino acid residues of a protein. - Encode the modified input into a latent space representation with an autoencoder. - Predict a length difference to a target sequence (corresponding to a different protein) with a length prediction model, based on at least one target function. - Apply transformation to the latent space representation reflecting the length difference. - Use a function classification model to assess at least one sequence function of the transformed representation. - Upon meeting predefined thresholds for function, decode the transformed representation to generate a data structure indicating amino acid residues of the target sequence.
The inventive features center on a multi-model approach for protein sequence design that leverages latent space manipulation, explicit length transformation, and function-based sequence evaluation, providing targeted generation of protein sequences with specified attributes.
Stated Advantages
Improves computational efficiency for de novo protein design by sampling from a learned data distribution associated with known protein sequences, avoiding exhaustive combinatorial search.
Enables explicit function-guided protein sequence design, providing direct mapping from structural modification to altered or attained functions through function classifier integration.
Allows generation of novel protein sequences of variable length, overcoming limitations of fixed-backbone approaches common in traditional protein design techniques.
Enhances the diversity and functionality of generated sequences by conditioning sampling and sequence generation on desired or undesired functions.
Supports the integration of multiple functional objectives or profiles, permitting the design of protein sequences targeting one or several desired functions simultaneously.
Provides capability to redesign proteins to remove undesired functions, not just add or preserve desired functions.
Documented Applications
Designing de novo protein sequences with specified functions for use in large molecule drug discovery, such as antibodies binding to antigens including viral or tumor antigens.
Accelerating synthetic biology, agriculture, medicine, and nanotechnology research through the development of new enzymes, peptides, and biosensors.
Redesigning protein sequences to attain new functional capabilities (e.g., converting a beta-sheet protein to an alpha-helical transporter protein with ion transmembrane transporter activity).
Recovering or introducing metal binding sites in proteins for applications that require specific binding functionality (e.g., calcium ion binding proteins).
Redesigning enzyme functions, such as modifying cutinases to maintain or alter catalytic residues for targeted enzymatic activity.
- Interested in licensing this patent?