Systems and methods for analyzing sequence data

Inventors

Kural, Deniz

Assignees

Seven Bridges Genomics Inc

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Publication Number

US-11756652-B2

Patent

Publication Date

2023-09-12

Expiration Date


Abstract

The invention provides methods for comparing one set of genetic sequences to another without discarding any information within either set. A set of genetic sequences is represented using a directed acyclic graph (DAG) avoiding any unwarranted reduction to a linear data structure. The invention provides a way to align one sequence DAG to another to produce an alignment that can itself be stored as a DAG. DAG-to-DAG alignment is a natural choice wherever a set of genomic information consisting of more than one string needs to be compared to any non-linear reference. For example, a subpopulation DAG could be compared to a population DAG in order to compare the genetic features of that subpopulation to those of the population.

Core Innovation

The invention provides a framework for genomic comparison that represents nucleotide sequences as a directed acyclic graph (DAG) data structure rather than linear sequence strings. A plurality of nodes connected by a plurality of edges forms paths, where a node represents a first nucleotide sequence stored as a first string of one or more symbols.

Sequence analysis is performed by obtaining a plurality of sequence reads previously obtained from one or more genetic samples and aligning at least some of the plurality of sequence reads to the DAG data structure. Based on results of the aligning, the invention determines support values for nodes in a first path of a plurality of paths, where a support value for a node is indicative of a number of sequence reads aligned to the first string.

The invention then determines a first support value for the first path based on the support values for the nodes in the first path, and includes the first path in the one or more paths when the first support value exceeds a first threshold, followed by outputting at least one of the one or more identified paths. In disclosed dependent refinements, path support may be determined from node support values using a minimum relationship, and the invention supports selecting which haplotype paths to output based on thresholded support and relationships to sample/genome counts.

Claims Coverage

The independent claims present a method, a system, and a non-transitory computer-readable storage medium that obtain sequence reads from genetic samples, align them to a DAG representing nucleotide sequences, compute node and path support values, and include/output paths when a first path support value exceeds a first threshold. Across the independent claims, the inventive features center on DAG-based read alignment, support-value computation, and threshold-based path identification/output.

DAG-based sequence read alignment to identify supported paths

Obtaining a plurality of sequence reads previously obtained from one or more genetic samples; obtaining a directed acyclic graph (DAG) data structure with nodes representing nucleotide sequence strings; aligning at least some of the plurality of sequence reads to the DAG data structure; and identifying one or more paths based on results of the aligning.

Node support values and first node-path support aggregation

Determining support values for nodes in a first path based on the results of the aligning, the support values including a support value for a first node indicative of a number of sequence reads aligned to the first string; and determining a first support value for the first path based on the support values for the nodes in the first path.

First threshold-based inclusion and output of identified paths

Including the first path in the one or more paths when the first support value exceeds a first threshold; and outputting at least one of the one or more identified paths.

Across the independent claims, the core claim coverage is directed to aligning sequence reads to a DAG of nucleotide sequence nodes, computing node-level support values and a first support value for a path from those node supports, and selecting paths whose first support value exceeds a first threshold for output. Dependent claims further refine how the first path support is aggregated and add threshold-based logic for conditional or subset output.

Stated Advantages

Not explicitly described in patent.

Documented Applications

Not explicitly described in patent.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.