Systems and methods for facilitating rapid genome sequence analysis

Inventors

Holt, Carson HintonYandell, Mark

Assignees

University of UtahUniversity of Utah Research Foundation Inc

Publication Number

US-12176070-B2

Publication Date

2024-12-24

Expiration Date

2042-04-19

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

A method for facilitating rapid genome sequence analysis includes accessing an output stream of an alignment process that includes aligned reads of a biological sequence that are aligned to a reference genome. The method also includes distributing the aligned reads to a plurality of computing nodes based on genomic position. Each of the plurality of computing nodes is assigned to a separate data bin of a plurality of data bins associated with genomic position. The method also includes, for at least one aligned read determined to overlap separate data bins of the plurality of data bins, duplicating the at least one aligned read and distributing the at least one aligned read to separate computing nodes of the plurality of computing nodes that are assigned to the separate data bins.

Core Innovation

The invention addresses the computational challenge of rapid genome sequence analysis by introducing systems and methods that efficiently access, distribute, and process aligned reads of biological sequences. Specifically, the method accesses an output stream from an alignment process, such as Burrows-Wheeler Aligner, which outputs aligned reads of a biological sequence to a reference genome. These aligned reads are then distributed to multiple computing nodes based on their genomic positions, with each computing node assigned to a data bin associated with a particular genomic range.

A key innovation is the handling of aligned reads that overlap adjacent genomic regions (data bins); such reads are duplicated and distributed to all relevant computing nodes, ensuring comprehensive data coverage across boundary regions. This selective duplication facilitates parallel and redundant storage, supporting simultaneous and regionally independent downstream analysis, while also preventing data loss at bin boundaries. Local files generated at each node may include redundant entries at the file's start or end, reflecting the overlap between bins.

To prevent data duplication when merging local files into a final merged file, the method identifies regions of interest within each local file by selectively decompressing only the necessary compression blocks. This process enables efficient recombination without full decompression or prior indexing, streamlining the assembly of large merged genome data outputs. The patent also describes methods for validating the merged file by comparing hash values for reads between the original and merged files, ensuring data integrity and correctness throughout the parallelized workflow.

Claims Coverage

The patent contains one independent claim, detailing three main inventive features.

Distributing aligned reads to data bins and computing nodes based on genomic position

The method distributes aligned reads from the output stream of an alignment process to a plurality of data bins based on genomic position. Each data bin is assigned to a separate computing node, such that assigning an aligned read to a bin corresponds to distributing the read to its respective computing node. This approach enables parallel processing by mapping genomic regions to distinct nodes for more efficient genome sequence analysis.

Duplicating aligned reads overlapping separate data bins

For any aligned read determined to overlap separate data bins, the system duplicates the read and distributes it to the corresponding computing nodes assigned to those data bins. This ensures that no region of interest loses data coverage at bin boundaries and supports accurate downstream analysis, even in overlapping genomic areas.

Accessing and processing the output stream of an alignment process

The invention requires accessing an output stream from an alignment process, where the output stream includes the aligned reads of a biological sequence aligned to a reference genome. This step forms the foundation for subsequent distribution and computational steps within the overall genome sequence analysis workflow described in the claim.

Collectively, these inventive features enable efficient, parallelized genome sequence analysis by distributing processing duties across computing nodes mapped to genomic positions, with data redundancy at bin overlaps to ensure completeness and correctness.

Stated Advantages

Facilitates rapid genome sequence analysis compared to traditional methods.

Allows parallel processing and simultaneous secondary analyses by distributing data across multiple computing nodes based on genomic position.

Prevents data loss and duplication at genomic bin boundaries through duplication of overlapping reads.

Enables efficient merging of regional files without the need for full decompression or indexing by locating boundaries within compression blocks.

Provides reliable validation of merged files using hash comparisons, ensuring data integrity in large-scale parallel processing workflows.

Documented Applications

Rapid analysis of genome sequences in large-scale sequencing projects.

Screening for mutations in oncology settings, including melanoma, breast cancer, and lung cancer patients.

Identifying gene lists of causal candidates for genetic disease treatment and in vivo experimentation.

Burden analysis (e.g., VAAST) and SNP/haplotype statistical analysis (e.g., GPAT) for populations of interest.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.