Generating machine learning models using genetic data

Inventors

Otte, GabrielRoberts, CharlesDrake, AdamEnnis, Riley Charles

Assignees

Freenome Holdings Inc

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Publication Number

US-11514289-B1

Patent

Publication Date

2022-11-29

Expiration Date


Abstract

Systems, methods, and apparatuses for generating and using machine learning models using genetic data. A set of input features for training the machine learning model can be identified and used to train the model based on training samples, e.g., for which one or more labels are known. As examples, the input features can include aligned variables (e.g., derived from sequences aligned to a population level or individual references) and/or non-aligned variables (e.g., sequence content). The features can be classified into different groups based on the underlying genetic data or intermediate values resulting from a processing of the underlying genetic data. Features can be selected from a feature space for creating a feature vector for training a model. The selection and creation of feature vectors can be performed iteratively to train many models as part of a search for optimal features and an optimal model.

Core Innovation

The invention describes creating and implementing a machine learning model for performing cancer classifications of biological samples obtained from human subjects. Training samples are received, each including nucleic acids from a corresponding human subject, and nucleic acids are sequenced to obtain sets of sequence reads corresponding to a plurality of chromosomes. Known labels for classification of cancer for each corresponding human subject are obtained and received at a computer system together with the plurality of sets of sequence reads.

For each sample, the model uses a set of features to be input to the machine learning model, including non-aligned variables comprising statistical measures of occurrence of Kmers of a Kmer database in the set of sequence reads. The sequence reads are analyzed to obtain a training vector or a feature vector with values of the set of features, where each element corresponds to a feature that includes one or more variables. The training process operates on training vectors using parameters of the machine learning model to obtain output labels and compares the output labels to the known labels.

The model is trained by iteratively searching for optimal values of the parameters based on the comparing of output labels to the known labels, and providing the parameters of the machine learning model and the set of features. In related embodiments, the set of features can additionally include aligned variables, including properties of sequence reads aligned to windows of a human reference genome and sequence similarity, and can use feature representations such as Kmer histograms and a Kmer covariance matrix based on distances between pairs of Kmer histograms to determine features. The invention also supports implementing the machine learning model to obtain an output label comprising a classification of a cancer for a sample and providing the output label.

Claims Coverage

The document includes three independent claims. Across these, the claims center on extracting a feature vector from sequencing reads using non-aligned Kmer-based statistical measures, applying a machine learning model with parameters to output cancer classification labels, and, for training, iteratively comparing output labels to known labels to optimize parameters.

Training a cancer classification machine learning model using non-aligned Kmer occurrence features

Receiving training samples including nucleic acids from human subjects, sequencing nucleic acids to obtain sets of sequence reads corresponding to a plurality of chromosomes, obtaining known labels for classification of cancer for the corresponding human subject, receiving the sets of sequence reads and the known labels at a computer system, identifying a set of features including non-aligned variables comprising statistical measures of occurrence of Kmers of a Kmer database in the set of sequence reads, analyzing the set of sequence reads to obtain a training vector with values of the set of features, operating on the training vectors using parameters of the machine learning model to obtain output labels, comparing the output labels to the known labels, iteratively searching for optimal values of the parameters based on the comparing, and providing the parameters and the set of features.

Implementing a cancer classification machine learning model using a non-aligned Kmer feature vector

Sequencing nucleic acids of a human subject to obtain a set of sequence reads corresponding to a plurality of chromosomes, storing definitions of a set of features including non-aligned variables comprising statistical measures of occurrence of Kmers of a Kmer database in the set of sequence reads, analyzing the set of sequence reads to obtain a feature vector comprising values of the set of features, operating on the feature vector using parameters of the machine learning model to obtain an output label comprising a classification of a cancer for the sample, and providing the output label.

Computer product instructions for cancer classification using non-aligned Kmer feature vectors

A computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed on one or more processors, perform sequencing nucleic acids of a human subject to obtain a set of sequence reads corresponding to a plurality of chromosomes, storing definitions of a set of features including non-aligned variables comprising statistical measures of occurrence of Kmers of a Kmer database in the set of sequence reads, analyzing the set of sequence reads to obtain a feature vector comprising values of the set of features, operating on the feature vector using parameters of the machine learning model to obtain an output label comprising a classification of a cancer for the sample, and providing the output label.

Collectively, the independent claims define training and implementing a machine learning model for cancer classification using a feature vector derived from sequencing reads with non-aligned variables based on statistical measures of Kmers from a Kmer database, and a computer product that executes instructions to perform the same feature-extraction and classification steps and output a cancer classification label.

Stated Advantages

Not explicitly described in patent.

Documented Applications

Not explicitly described in patent.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.