Systems and methods for rapid gene set enrichment analysis
Inventors
Assignees
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
Systems and methods for rapid gene set enrichment analysis and applications thereof are described. In certain embodiments, the systems and methods described herein may be used to identify one or more candidate therapies for treatment of a disease (e.g., cancer). These systems and methods enable improved prioritization of relevant gene sets while maintaining a relatively lower false positive rate. Additionally, the ability to accelerate enrichment analysis and analyze hundreds of thousands of gene sets is described.
Core Innovation
The invention provides Cosiner systems and methods for rapid gene set enrichment analysis using vector space models and sparse matrix linear algebra to compute cosine similarity between a disease or query gene signature and therapy or drug gene signatures. Each gene set is represented as a numeric vector, and the similarity is determined from an angle between the vectors. The approach includes direct enrichment based on cosine similarity overlap between signatures and enables enrichment scoring at scale.
The invention further includes enrichment by per-gene partial cosine contributions, including weighted variants such as tf-idf and PPMI weighting. These per-gene enrichment statistics support calculation of overlap genes weighted by their partial cosines. The framework is described as enabling identification of candidate therapies and drug mechanisms by linking disease-indicative gene signatures with therapy-associated signatures.
The disclosed approach is applied to multiple biological pathway and drug-related use cases, including prioritizing KEGG pathways and identifying HER2+ breast cancer candidate therapies and mechanisms using LINCS/TCGA/ARCHS4/NCI Nature and WikiPathways. The document also describes screening GSK3 inhibitors for metastatic phenotype reversal with experimental invasion assay validation across multiple cell lines, and it presents representative computing and cloud environments for implementation.
Claims Coverage
The document includes two independent claims, a system claim and a method claim, that share the same core pipeline: represent a disease gene set and therapy gene sets as weighted numeric vectors, compute a similarity based on an angle between vectors, and select candidate therapies when similarity is less than a threshold. Dependent claim features further refine similarity computation and downstream enrichment statistics and interpretation.
Representing disease gene set as a first numeric vector of weighted differentially expressed genes
Identify a first gene set comprising differentially expressed genes of cells indicative of the disease as compared to cells not indicative of the disease, wherein the first gene set is represented as a first numeric vector comprising weighted values corresponding to gene expression data of the differentially expressed genes of the first gene set.
Representing therapy gene sets as second numeric vectors of weighted differentially expressed genes
For each of a plurality of therapies, identify a second gene set corresponding to each of the one or more candidate therapies, wherein the second gene set is represented as a second numeric vector comprising weighted values corresponding to gene expression data of differentially expressed genes of cells treated with the candidate therapy as compared to cells not treated with the candidate therapy.
Computing similarity as a function of an angle between numeric vectors
Determine a measure of similarity between the first gene set and the second gene set using the first numeric vector and the second numeric vector, wherein the measure of similarity is a function of an angle between the first numeric vector and the second numeric vector.
Selecting candidate therapies when similarity is less than a threshold indicative of signature reversal
Identify one or more members of the plurality of therapies that are candidates for treatment of the disease based on the measures of similarity being less than a threshold value, which is indicative that a molecular signature of the candidate therapy reverses a molecular signature of the disease.
Similarity defined using cosine similarity between numeric vectors
The measure of similarity is defined as a cosine similarity between the first numeric vector and the second numeric vector.
Similarity computation using sparse matrix linear algebra
The measure of similarity is carried out using sparse matrix linear algebra.
Computing per-gene enrichment statistics
Compute per-gene enrichment statistics and use them to identify a drug mechanism.
Weighting overlap genes by partial cosines
Weight the overlap genes by their partial cosines.
Ranking by numerical ordering of similarity values
Rank the similarity measures using a numerical ordering of the similarity measure values.
Overall, the claims coverage centers on converting disease and therapy gene sets into weighted numeric vectors and computing a similarity that is a function of an angle between the vectors, selecting therapies whose similarity is less than a threshold as indicative of reversal of a molecular signature. Dependent features further constrain the similarity computation, including cosine similarity and sparse matrix linear algebra, and add per-gene enrichment statistics to support drug mechanism identification, including weighting overlap genes by partial cosines and producing ranking via numerical ordering.
Stated Advantages
Faster runtimes
Scaling to hundreds of thousands of gene sets
Lower false positive rates via permutation testing
Documented Applications
Prioritizing KEGG pathways
Identifying HER2+ breast cancer candidate therapies and mechanisms, including HER2/ErbB2 enrichment, using LINCS/TCGA/ARCHS4/NCI Nature and WikiPathways
Screening GSK3 inhibitors for metastatic phenotype reversal with experimental invasion assay validation across multiple cell lines
Interested in licensing this patent?