Systems and methods for rational protein engineering with deep representation learning

Inventors

Khimulya, GrigoryAlley, EthanBiswas, Surojit

Assignees

ALLEY, ETHAN CHASENabla Bio Inc

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Publication Number

US-12040050-B1

Patent

Publication Date

2024-07-16

Expiration Date


Abstract

A dataset describing a collection of proteins is loaded, which identifies, for each protein, a respective value of a characteristic of interest. The dataset is provided as one or more inputs to a trained unsupervised representation model to cause the trained unsupervised representation model to generate a representation for each protein in the collection. The representation for each protein is input into a supervised top model to train the supervised top model to obtain a predicted characteristic and the trained supervised top model is used to obtain a predicted characteristic for a particular protein.

Core Innovation

The invention uses a dataset describing a collection of proteins, where the dataset further comprises, for each protein, a respective value of a characteristic of interest. The dataset is provided as one or more inputs to a trained unsupervised representation model to generate a representation for each protein in the collection, and the trained unsupervised representation model is fine-tuned on related subsets of protein sequences.

The generated representation for each protein is input into a supervised top model. The supervised top model is trained to obtain a predicted characteristic, and the trained supervised top model is used to obtain a predicted characteristic for a particular protein. The framework is implemented as a method, a non-transitory machine-readable storage medium, and a system.

Claims Coverage

The independent claims cover a protein-characteristic prediction framework that uses a trained unsupervised representation model fine-tuned on related subsets of protein sequences and a supervised top model trained on those representations to obtain a predicted characteristic for a particular protein. Parallel coverage is provided for method, non-transitory machine-readable storage medium, and system claims.

Protein dataset to fine-tuned unsupervised representation and supervised-top prediction

A method comprising loading a dataset describing a collection of proteins with a respective value of a characteristic of interest; providing the dataset as one or more inputs to a trained unsupervised representation model to generate a representation for each protein in the collection, where the trained unsupervised representation model is fine-tuned on related subsets of protein sequences; inputting the representation for each protein into a supervised top model to train the supervised top model to obtain a predicted characteristic; and using the trained supervised top model to obtain a predicted characteristic for a particular protein.

Machine-readable storage medium executing fine-tuned unsupervised representation and supervised-top prediction

At least one non-transitory machine-readable storage medium with instructions stored thereon, executable by a machine to cause the machine to load a dataset describing a collection of proteins with a respective value of a characteristic of interest; provide the dataset as one or more inputs to a trained unsupervised representation model to generate a representation for each protein in the collection, where the trained unsupervised representation model is fine-tuned on related subsets of protein sequences; input the representation for each protein into a supervised top model to train the supervised top model to obtain a predicted characteristic; and use the trained supervised top model to obtain a predicted characteristic for a particular protein.

System with trained unsupervised representation model fine-tuned on related subsets and supervised-top prediction

A system comprising a processor device, a memory, a trained unsupervised representation model, a supervised top model, and logic executable by the processor device to load a dataset describing a collection of proteins with a respective value of a characteristic of interest; provide the dataset as one or more inputs to the trained unsupervised representation model to generate a representation for each protein in the collection, where the trained unsupervised representation model is fine-tuned on related subsets of protein sequences; input the representation for each protein into the supervised top model to train the supervised top model to obtain a predicted characteristic; and use the trained supervised top model to obtain a predicted characteristic for a particular protein.

The inventive concept is the coupling of a trained unsupervised representation model fine-tuned on related subsets of protein sequences with a supervised top model trained to predict a characteristic, and then applying the trained supervised top model to obtain a predicted characteristic for a particular protein.

Stated Advantages

Improved average prediction performance (e.g., ~2×).

Ability to perform engineering/optimization and sequence prioritization over large protein libraries with lower cost (e.g., ~100× lower cost vs Doc2Vec).

Homology detection outperforming PSI-BLAST.

Documented Applications

Predicting characteristics of particular proteins using a predicted characteristic obtained from the trained supervised top model.

Using a protein sequence library to predict a value of a characteristic or function of interest and prioritizing proteins from the library based on predicted values.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.