Systems and methods for rational protein engineering with deep representation learning
Inventors
Khimulya, Grigory • Alley, Ethan • Biswas, Surojit
Assignees
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A dataset describing a collection of proteins is loaded, which identifies, for each protein, a respective value of a characteristic of interest. The dataset is provided as one or more inputs to a trained unsupervised representation model to cause the trained unsupervised representation model to generate a representation for each protein in the collection. The representation for each protein is input into a supervised top model to train the supervised top model to obtain a predicted characteristic and the trained supervised top model is used to obtain a predicted characteristic for a particular protein.
Core Innovation
The invention uses a dataset describing a collection of proteins, where the dataset further comprises, for each protein, a respective value of a characteristic of interest. The dataset is provided as one or more inputs to a trained unsupervised representation model to generate a representation for each protein in the collection, and the trained unsupervised representation model is fine-tuned on related subsets of protein sequences.
The generated representation for each protein is input into a supervised top model. The supervised top model is trained to obtain a predicted characteristic, and the trained supervised top model is used to obtain a predicted characteristic for a particular protein. The framework is implemented as a method, a non-transitory machine-readable storage medium, and a system.
Claims Coverage
The independent claims cover a protein-characteristic prediction framework that uses a trained unsupervised representation model fine-tuned on related subsets of protein sequences and a supervised top model trained on those representations to obtain a predicted characteristic for a particular protein. Parallel coverage is provided for method, non-transitory machine-readable storage medium, and system claims.
Protein dataset to fine-tuned unsupervised representation and supervised-top prediction
A method comprising loading a dataset describing a collection of proteins with a respective value of a characteristic of interest; providing the dataset as one or more inputs to a trained unsupervised representation model to generate a representation for each protein in the collection, where the trained unsupervised representation model is fine-tuned on related subsets of protein sequences; inputting the representation for each protein into a supervised top model to train the supervised top model to obtain a predicted characteristic; and using the trained supervised top model to obtain a predicted characteristic for a particular protein.
Machine-readable storage medium executing fine-tuned unsupervised representation and supervised-top prediction
At least one non-transitory machine-readable storage medium with instructions stored thereon, executable by a machine to cause the machine to load a dataset describing a collection of proteins with a respective value of a characteristic of interest; provide the dataset as one or more inputs to a trained unsupervised representation model to generate a representation for each protein in the collection, where the trained unsupervised representation model is fine-tuned on related subsets of protein sequences; input the representation for each protein into a supervised top model to train the supervised top model to obtain a predicted characteristic; and use the trained supervised top model to obtain a predicted characteristic for a particular protein.
System with trained unsupervised representation model fine-tuned on related subsets and supervised-top prediction
A system comprising a processor device, a memory, a trained unsupervised representation model, a supervised top model, and logic executable by the processor device to load a dataset describing a collection of proteins with a respective value of a characteristic of interest; provide the dataset as one or more inputs to the trained unsupervised representation model to generate a representation for each protein in the collection, where the trained unsupervised representation model is fine-tuned on related subsets of protein sequences; input the representation for each protein into the supervised top model to train the supervised top model to obtain a predicted characteristic; and use the trained supervised top model to obtain a predicted characteristic for a particular protein.
The inventive concept is the coupling of a trained unsupervised representation model fine-tuned on related subsets of protein sequences with a supervised top model trained to predict a characteristic, and then applying the trained supervised top model to obtain a predicted characteristic for a particular protein.
Stated Advantages
Improved average prediction performance (e.g., ~2×).
Ability to perform engineering/optimization and sequence prioritization over large protein libraries with lower cost (e.g., ~100× lower cost vs Doc2Vec).
Homology detection outperforming PSI-BLAST.
Documented Applications
Predicting characteristics of particular proteins using a predicted characteristic obtained from the trained supervised top model.
Using a protein sequence library to predict a value of a characteristic or function of interest and prioritizing proteins from the library based on predicted values.
Interested in licensing this patent?