Generation and use of simulated genomic data
Inventors
Foryciarz, Agata • Dean, II, Dennis A.
Assignees
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
Embodiments of the invention utilize a graph-based approach for simulating genomic datasets from large scale populations. Genomic data may be represented as a directed acyclic graph (DAG) that incorporates individual sample data including variant type, position, and zygosity. A simulator may operate on the DAG to generate variant datasets based on probabilistic traversal of the DAG. This probabilistic traversal reflects genomic variant types associated with the subpopulation used to build the DAG, and as a result, the generated variant datasets maintain statistical fidelity to the original sample data.
Core Innovation
The invention simulates genomic data for a population exhibiting a variant frequency from input data representing nucleic-acid sequences obtained by chemical analysis of biological samples. The input data are computationally represented as a directed acyclic graph (DAG) data structure including a plurality of nodes and edges connecting the nodes, where the nodes include an origin node and a terminus node and each node corresponds to a genomic position and a variant type.
Each edge of the DAG has weights corresponding to the occurrences of nodes connected by edges in the input data. Simulated genomic data are created by repeatedly traversing a sequence of nodes starting from the origin node and terminating with the terminus node while maintaining variant frequency of the population for the input data, and the traversal is performed probabilistically in accordance with the edge weights.
The simulated genomic data are stored and converted into a standard file format, and the DAG data structure is updated based on determination whether new variants or update counts of variants already present. The update uses a hash value assigned to each node by a hash function, where the hash value is a unique value determined based on the genomic position and the variant type associated with each node.
Claims Coverage
The document includes two independent claims. The claims cover weighted DAG representation of variant-frequency input data, probabilistic traversal to generate simulated genomic data while maintaining variant frequency, and hash-based updating of the DAG with storage and conversion to a standard file format.
Weighted directed acyclic graph representation of variant-frequency input data
Computationally representing input data for a population exhibiting a variant frequency as a directed acyclic graph (DAG) data structure with nodes and edges, including an origin node and a terminus node, each node corresponding to a genomic position and a variant type, and edges having weights corresponding to occurrences of nodes connected by edges in the input data.
Probabilistic DAG traversal maintaining variant frequency
Repeatedly traversing a sequence of nodes of the DAG data structure starting from the origin node and terminating with the terminus node to create simulated genomic data while maintaining variant frequency, with traversal performed probabilistically in accordance with the edge weights.
Storing simulated genomic data and converting to standard file format
Storing the simulated genomic data and converting the simulated genomic data into a standard file format.
Hash-based update of the DAG using genomic position and variant type
Updating the DAG data structure based on determination whether new variants or update counts of variants already present using a hash value assigned to each node by a hash function, the hash value being a unique value determined based on the genomic position and the variant type associated with each node.
Overall, the independent claims focus on weighted DAG representation, probabilistic traversal for simulated genomic data while maintaining variant frequency, and hash-based updating with storage and conversion to a standard file format.
Stated Advantages
Documented Applications
Benchmarking/algorithm evaluation using multi-population and subpopulation simulations for simulated genomic data.
Interested in licensing this patent?