Generation and use of simulated genomic data

Inventors

Foryciarz, AgataDean, II, Dennis A.

Assignees

Seven Bridges Genomics Inc

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Publication Number

US-12119089-B2

Patent

Publication Date

2024-10-15

Expiration Date


Abstract

Embodiments of the invention utilize a graph-based approach for simulating genomic datasets from large scale populations. Genomic data may be represented as a directed acyclic graph (DAG) that incorporates individual sample data including variant type, position, and zygosity. A simulator may operate on the DAG to generate variant datasets based on probabilistic traversal of the DAG. This probabilistic traversal reflects genomic variant types associated with the subpopulation used to build the DAG, and as a result, the generated variant datasets maintain statistical fidelity to the original sample data.

Core Innovation

The invention simulates genomic data for a population exhibiting a variant frequency from input data representing nucleic-acid sequences obtained by chemical analysis of biological samples. The input data are computationally represented as a directed acyclic graph (DAG) data structure including a plurality of nodes and edges connecting the nodes, where the nodes include an origin node and a terminus node and each node corresponds to a genomic position and a variant type.

Each edge of the DAG has weights corresponding to the occurrences of nodes connected by edges in the input data. Simulated genomic data are created by repeatedly traversing a sequence of nodes starting from the origin node and terminating with the terminus node while maintaining variant frequency of the population for the input data, and the traversal is performed probabilistically in accordance with the edge weights.

The simulated genomic data are stored and converted into a standard file format, and the DAG data structure is updated based on determination whether new variants or update counts of variants already present. The update uses a hash value assigned to each node by a hash function, where the hash value is a unique value determined based on the genomic position and the variant type associated with each node.

Claims Coverage

The document includes two independent claims. The claims cover weighted DAG representation of variant-frequency input data, probabilistic traversal to generate simulated genomic data while maintaining variant frequency, and hash-based updating of the DAG with storage and conversion to a standard file format.

Weighted directed acyclic graph representation of variant-frequency input data

Computationally representing input data for a population exhibiting a variant frequency as a directed acyclic graph (DAG) data structure with nodes and edges, including an origin node and a terminus node, each node corresponding to a genomic position and a variant type, and edges having weights corresponding to occurrences of nodes connected by edges in the input data.

Probabilistic DAG traversal maintaining variant frequency

Repeatedly traversing a sequence of nodes of the DAG data structure starting from the origin node and terminating with the terminus node to create simulated genomic data while maintaining variant frequency, with traversal performed probabilistically in accordance with the edge weights.

Storing simulated genomic data and converting to standard file format

Storing the simulated genomic data and converting the simulated genomic data into a standard file format.

Hash-based update of the DAG using genomic position and variant type

Updating the DAG data structure based on determination whether new variants or update counts of variants already present using a hash value assigned to each node by a hash function, the hash value being a unique value determined based on the genomic position and the variant type associated with each node.

Overall, the independent claims focus on weighted DAG representation, probabilistic traversal for simulated genomic data while maintaining variant frequency, and hash-based updating with storage and conversion to a standard file format.

Stated Advantages

Documented Applications

Benchmarking/algorithm evaluation using multi-population and subpopulation simulations for simulated genomic data.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.