Generating machine learning models using genetic data

Inventors

Otte, GabrielRoberts, CharlesDrake, AdamENNIS, Riley

Assignees

Freenome Holdings Inc

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Publication Number

US-12242943-B2

Patent

Publication Date

2025-03-04

Expiration Date


Abstract

Systems, methods, and apparatuses for generating and using machine learning models using genetic data. A set of input features for training the machine learning model can be identified and used to train the model based on training samples, e.g., for which one or more labels are known. As examples, the input features can include aligned variables (e.g., derived from sequences aligned to a population level or individual references) and/or non-aligned variables (e.g., sequence content). The features can be classified into different groups based on the underlying genetic data or intermediate values resulting from a processing of the underlying genetic data. Features can be selected from a feature space for creating a feature vector for training a model. The selection and creation of feature vectors can be performed iteratively to train many models as part of a search for optimal features and an optimal model.

Core Innovation

The disclosed invention creates and implements a machine learning model for detection of cancer in a biological sample using genetic data from nucleic acids. It receives a plurality of training samples from corresponding human subjects with cancer, sequences nucleic acids to obtain sets of sequence reads corresponding to chromosomes, and obtains known labels for cancer detection for each subject. The model is trained by receiving the sequence reads and known labels at a computer system and iteratively searching for optimal model parameters based on comparing output labels to known labels.

A core aspect of the invention is the definition of a set of features to be input to the machine learning model for each training sample. The set of features includes non-aligned variables that include statistical measures of occurrence of Kmers of a Kmer database in the set of sequence reads. From the analyzed sequence reads, the invention obtains a training vector or a feature vector in which each element corresponds to a feature that includes one or more variables, and then operates on the vector using parameters of the machine learning model to obtain output labels for detection of cancer.

The invention also encompasses feature definitions and representations that may extend beyond non-aligned Kmer-occurrence measures, including aligned variables tied to windows of a human reference genome and sequence similarity between aligned sequence reads and each window. Feature representations can include a feature vector comprising values of the set of features for the set of sequence reads, where each element corresponds to a feature including one or more variables. The computer-readable medium stores instructions that perform the sequencing, feature-vector generation, model operation, and providing an output label indicating detection of cancer.

Claims Coverage

The independent claims cover three main aspects: training-time creation of a cancer-detection machine learning model using non-aligned Kmer database occurrence statistics as features; implementing the trained model to output a cancer detection label from sequence reads using the same feature-vector construct; and providing the method as a computer product (non-transitory computer readable medium) that performs the machine-learning implementation workflow.

Iterative training of cancer-detection model from labeled sequence-read features

Receiving training samples from human subjects having cancer, sequencing nucleic acids to obtain chromosome-associated sets of sequence reads, obtaining known labels for detection of cancer, identifying a set of features including non-aligned variables with statistical measures of Kmer occurrence in a Kmer database, analyzing sequence reads to obtain training vectors, operating on training vectors to obtain output labels, comparing output labels to known labels, iteratively searching for optimal parameters by comparing output labels to known labels, and providing the parameters and the set of features.

Outputting a cancer detection label from non-aligned Kmer feature vectors

Sequencing nucleic acids of a human subject to obtain chromosome-associated sequence reads, storing feature definitions including non-aligned variables with statistical measures of Kmer occurrence in a Kmer database, analyzing the sequence reads to obtain a feature vector with values corresponding to the defined features, operating on the feature vector using machine learning model parameters to obtain an output label comprising detection of cancer, and providing the output label.

Computer product implementing cancer-detection machine learning model workflow

A non-transitory computer readable medium storing instructions that, when executed, perform sequencing of nucleic acids to obtain chromosome-associated sequence reads, storing feature definitions including non-aligned Kmer-occurrence variables, analyzing sequence reads to obtain a feature vector, operating on the feature vector using machine learning model parameters to obtain an output label comprising detection of cancer, and providing the output label.

Across the independent claims, the invention is directed to sequencing-derived feature vectors for cancer detection, specifically including non-aligned variables defined using statistical measures of Kmer occurrence from a Kmer database, and using machine learning model parameters to produce output labels for detection of cancer, either during iterative training, during implementation, or via a computer product that performs the implementation workflow.

Stated Advantages

Not explicitly described in patent.

Documented Applications

Not explicitly described in patent.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.