Systems and methods for rapid gene set enrichment analysis

Inventors

Koytiger, Grigoriy

Assignees

Immuneering Corp

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Publication Number

US-11043305-B1

Publication Date

2021-06-22

Expiration Date

Abstract

Systems and methods for rapid gene set enrichment analysis and applications thereof are described. In certain embodiments, the systems and methods described herein may be used to identify one or more candidate therapies for treatment of a disease (e.g., cancer). These systems and methods enable improved prioritization of relevant gene sets while maintaining a relatively lower false positive rate. Additionally, the ability to accelerate enrichment analysis and analyze hundreds of thousands of gene sets is described.

Core Innovation

The invention provides Cosiner systems and methods for rapid gene set enrichment analysis using vector space models and sparse matrix linear algebra to compute cosine similarity between a disease or query gene signature and therapy or drug gene signatures. Each gene set is represented as a numeric vector, and the similarity is determined from an angle between the vectors. The approach includes direct enrichment based on cosine similarity overlap between signatures and enables enrichment scoring at scale.

The invention further includes enrichment by per-gene partial cosine contributions, including weighted variants such as tf-idf and PPMI weighting. These per-gene enrichment statistics support calculation of overlap genes weighted by their partial cosines. The framework is described as enabling identification of candidate therapies and drug mechanisms by linking disease-indicative gene signatures with therapy-associated signatures.

The disclosed approach is applied to multiple biological pathway and drug-related use cases, including prioritizing KEGG pathways and identifying HER2+ breast cancer candidate therapies and mechanisms using LINCS/TCGA/ARCHS4/NCI Nature and WikiPathways. The document also describes screening GSK3 inhibitors for metastatic phenotype reversal with experimental invasion assay validation across multiple cell lines, and it presents representative computing and cloud environments for implementation.

Claims Coverage

The document includes two independent claims, a system claim and a method claim, that share the same core pipeline: represent a disease gene set and therapy gene sets as weighted numeric vectors, compute a similarity based on an angle between vectors, and select candidate therapies when similarity is less than a threshold. Dependent claim features further refine similarity computation and downstream enrichment statistics and interpretation.

Representing disease gene set as a first numeric vector of weighted differentially expressed genes

Identify a first gene set comprising differentially expressed genes of cells indicative of the disease as compared to cells not indicative of the disease, wherein the first gene set is represented as a first numeric vector comprising weighted values corresponding to gene expression data of the differentially expressed genes of the first gene set.

Representing therapy gene sets as second numeric vectors of weighted differentially expressed genes

For each of a plurality of therapies, identify a second gene set corresponding to each of the one or more candidate therapies, wherein the second gene set is represented as a second numeric vector comprising weighted values corresponding to gene expression data of differentially expressed genes of cells treated with the candidate therapy as compared to cells not treated with the candidate therapy.

Computing similarity as a function of an angle between numeric vectors

Determine a measure of similarity between the first gene set and the second gene set using the first numeric vector and the second numeric vector, wherein the measure of similarity is a function of an angle between the first numeric vector and the second numeric vector.

Selecting candidate therapies when similarity is less than a threshold indicative of signature reversal

Identify one or more members of the plurality of therapies that are candidates for treatment of the disease based on the measures of similarity being less than a threshold value, which is indicative that a molecular signature of the candidate therapy reverses a molecular signature of the disease.

Similarity defined using cosine similarity between numeric vectors

The measure of similarity is defined as a cosine similarity between the first numeric vector and the second numeric vector.

Similarity computation using sparse matrix linear algebra

The measure of similarity is carried out using sparse matrix linear algebra.

Computing per-gene enrichment statistics

Compute per-gene enrichment statistics and use them to identify a drug mechanism.

Weighting overlap genes by partial cosines

Weight the overlap genes by their partial cosines.

Ranking by numerical ordering of similarity values

Rank the similarity measures using a numerical ordering of the similarity measure values.

Overall, the claims coverage centers on converting disease and therapy gene sets into weighted numeric vectors and computing a similarity that is a function of an angle between the vectors, selecting therapies whose similarity is less than a threshold as indicative of reversal of a molecular signature. Dependent features further constrain the similarity computation, including cosine similarity and sparse matrix linear algebra, and add per-gene enrichment statistics to support drug mechanism identification, including weighting overlap genes by partial cosines and producing ranking via numerical ordering.

Stated Advantages

Faster runtimes

Scaling to hundreds of thousands of gene sets

Lower false positive rates via permutation testing

Documented Applications

Prioritizing KEGG pathways

Identifying HER2+ breast cancer candidate therapies and mechanisms, including HER2/ErbB2 enrichment, using LINCS/TCGA/ARCHS4/NCI Nature and WikiPathways

Screening GSK3 inhibitors for metastatic phenotype reversal with experimental invasion assay validation across multiple cell lines

Abstract
Claims Coverage
Core Innovation
Stated Advantages
Documented Applications
Interested in licensing this patent?

Systems and methods for rapid gene set enrichment analysis

Inventors

Assignees

Interested in licensing this patent?

Publication Number

Publication Date

Expiration Date

Abstract

Core Innovation

Claims Coverage

Representing disease gene set as a first numeric vector of weighted differentially expressed genes

Representing therapy gene sets as second numeric vectors of weighted differentially expressed genes

Computing similarity as a function of an angle between numeric vectors

Selecting candidate therapies when similarity is less than a threshold indicative of signature reversal

Similarity defined using cosine similarity between numeric vectors

Similarity computation using sparse matrix linear algebra

Computing per-gene enrichment statistics

Weighting overlap genes by partial cosines

Ranking by numerical ordering of similarity values

Stated Advantages

Documented Applications

Interested in licensing this patent?

Stay Connected with MTEC