Methods and systems for detecting sequence variants
Inventors
Assignees
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
Core Innovation
The invention aligns one or more sequence reads to a genomic reference graph. The genomic reference graph represents at least 1,000,000 nucleic acids and comprises nodes and edges connecting the nodes, where the nodes include a first node and one or more parent nodes of the first node. The first node represents a first nucleotide sequence stored as a first string of symbols, and the data structure stores data specifying the nodes and edges.
For aligning each particular sequence read, the invention uses a dynamic programming algorithm that determines scores for entries in a first matrix associated with the first node. The first matrix represents a comparison between the particular sequence read and the first string of symbols, and the determining includes determining whether a symbol of the particular sequence read matches a first symbol of the first string of symbols. The method further accesses a score from one or more matrices associated with the parent nodes and determines a score for an entry in the first matrix based on the match result and the accessed score.
The particular sequence read is aligned to the genomic reference graph based on the determined scores, and output indicative of results of aligning the one or more sequence reads to the genomic reference graph is generated. The method is applied to sequence reads previously obtained from a biological sample from a subject, and the reference graph represents at least part of an organism’s genome and the genetic variation of that part.
In the described implementations, downstream analysis can identify genomic variants based on alignment results, including identifying the presence of a first genomic variant. The first genomic variant may be associated with nodes in the graph, and the presence can be determined based on whether at least one sequence read aligns at least partially to a node representing the variant.
Claims Coverage
The independent claims are directed to dynamic programming alignment of sequence reads to a genomic reference graph using node-based and parent-propagated scoring, and generating output indicative of the alignment results. The independent claim set includes four inventive features covering the alignment method, a non-transitory storage medium, and a system.
Dynamic programming node-and-parent score propagation for graph alignment
Accessing a genomic reference graph data structure with nodes and edges where a first node has one or more parent nodes; aligning one or more sequence reads to the genomic reference graph using a dynamic programming algorithm that determines scores in a first matrix for the first node using match determination against a first string of symbols and a score accessed from matrices associated with the one or more parent nodes; determining a score for an entry in the first matrix based on the match result and the accessed parent-node score; aligning the particular sequence read based on the determined scores.
Alignment results output indicative of genomic reference graph alignment
Generating output indicative of results of aligning the one or more sequence reads to the genomic reference graph.
Execution on a non-transitory storage medium for dynamic programming graph alignment
Using at least one non-transitory storage medium storing processor-executable instructions that, when executed by at least one processor, cause the processor to perform accessing of the genomic reference graph data structure, dynamic programming node-and-parent score propagation alignment of the sequence reads to the genomic reference graph, and generating output indicative of the aligning results.
A system configured for dynamic programming graph read alignment
A system comprising at least one processor and at least one non-transitory storage medium storing processor-executable instructions that, when executed, cause the processor to perform the dynamic programming node-and-parent score propagation aligning of the one or more sequence reads to the genomic reference graph and generate output indicative of the aligning results.
The independent claims consistently cover dynamic programming alignment of sequence reads to a genomic reference graph using scores in a first matrix for a first node combined with scores from matrices associated with parent nodes, followed by generation of output indicative of the aligning results. Two independent claims additionally recast the same approach as a non-transitory storage medium and a system including processors and the non-transitory storage medium.
Stated Advantages
Scalability by representing and aligning to a genomic reference graph representing at least 1,000,000 nucleic acids.
Documented Applications
Aligning sequence reads previously obtained from a biological sample from a subject to a genomic reference graph representing at least part of an organism’s genome and the genetic variation of that part.
Identifying the presence of a first genomic variant based on alignment results, including determining presence based on whether at least one sequence read aligns at least partially to a node representing that variant.
Interested in licensing this patent?