Systems and methods for SNP analysis and genome sequencing
Inventors
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Assignees
NoblisNoblis is a nonprofit research and technical organization supporting federal missions in defense, health, environment, and security. Emphasizing applied sciences, engineering, digital transformation, artificial intelligence, cloud, and cybersecurity, Noblis provides objective solutions for government agencies confronting complex operational and scientific challenges.
Noblis is a nonprofit research and technical organization supporting federal missions in defense, health, environment, and security. Emphasizing applied sciences, engineering, digital transformation, artificial intelligence, cloud, and cybersecurity, Noblis provides objective solutions for government agencies confronting complex operational and scientific challenges.
Abstract
In some embodiments, techniques for identifying one or more species in an undifferentiated environmental sample comprising a plurality of nucleic acid sequences are provided. One or more indices that represent a plurality of reference nucleic acid sequences may be provided, and data may be received comprising digital representations of respective nucleic acid sequences. The respective nucleic acid sequences may be aligned, if possible, using the indices. A respective alignment ration may be calculated for each one of the reference nucleic acid sequences, based on the number of nucleic acid sequences aligned to the respective reference nucleic acid sequence and the total number of nucleic acid sequences in the received data.
Core Innovation
The invention provides a system and method for aligning data representing a nucleic acid sequence by generating an indexed alignment framework. A first subsequence is retrieved from a first position of a first nucleic acid sequence and a hash of the first subsequence is computed to determine a corresponding element of an index. The index comprises elements corresponding to potential permutations of the first nucleic acid sequence, and the elements are limited based on statistical methods regarding which permutations are most likely to occur to less than a total possible number of permutations.
In the generated index, the corresponding element stores position data reflecting the first position and indicating the first nucleic acid sequence. The system then receives data comprising a second nucleic acid sequence and computes a second hash of a second subsequence, where the second hash is also independent of genomic position. Using an element of the index determined based on the computed second hash, position data for a reference nucleic acid sequence is determined, and the second subsequence is compared with the first nucleic acid sequence at a position indicated by the position data.
Alignment determination is performed based on the comparison by determining whether a number of bases greater than a predetermined threshold number of bases are mismatched. When the number of mismatched bases is less than the predetermined threshold number of bases, the system determines that the second subsequence is aligned with the first nucleic acid sequence. The disclosed framework is described in the context of an indexed alignment framework for nucleic acid sequences, and also includes alignment of amino-acid sequences and mismatch-based comparison with a mismatch threshold.
The framework also includes detecting SNPs using a consensus derived from aligned sequences with confidence-thresholded mismatches versus a reference and identifying species in mixed samples by aligning subsequences to species-specific indices and computing alignment ratios.
Claims Coverage
Independent claim clm-00001 covers an index-based subsequence alignment system with permutation-limited hash indices independent of genomic position and mismatch-threshold alignment decisions. Dependent claims further refine the independent claim with specific constraints and shifted subsequence comparison via additional hashed index elements.
Permutation-limited hash index independent of genomic position
Generating an index for a first nucleic acid sequence by identifying a first subsequence retrieved from a first position, computing a hash of the first subsequence to determine a corresponding element of the index, where the index comprises elements corresponding to potential permutations of the first nucleic acid sequence, the elements are limited based on statistical methods regarding which permutations are most likely to occur to less than a total possible number of permutations, and the hash is a numerical representation computed based on the first subsequence and independent of genomic position of the first subsequence.
Index-stored position data for reference location
Storing, in the corresponding element of the index, position data reflecting the first position and indicating the first nucleic acid sequence.
Hash-based position lookup and mismatch-threshold alignment decision
Computing a second hash of a second subsequence independent of genomic position, determining position data for the reference nucleic acid sequence using an element of the index determined based on the computed second hash, comparing the second subsequence with the first nucleic acid sequence at the position indicated by the position data, determining whether a number of bases greater than a predetermined threshold number of bases are mismatched, and determining that the second subsequence is aligned with the first nucleic acid sequence when the number of mismatched bases is less than the predetermined threshold number of bases.
Subsequence length constraint
Configuring the system such that the first subsequence has a length of 16 bases.
Mismatch threshold value constraint
Configuring the system with a predetermined threshold of 3 bases.
Nucleic acid type constraint
Configuring the system so the first nucleic acid sequence includes at least one of DNA, cDNA, RNA, mRNA, and PNA.
One-base positional offset for corresponding positions
Configuring the system so the second position is offset by one base of the first nucleic acid sequence relative to the first position.
Shifted subsequence indexing and comparison via hash-indicated positions
Storing instructions that form a shifted subsequence from a second nucleic acid sequence, hash it to obtain an index element with position data pointing to matching regions in a first nucleic acid sequence, and then comparing the shifted subsequence with the first nucleic acid sequence at those indicated positions.
Overall, the claims cover an index-driven alignment approach that uses hashes independent of genomic position, limits index size using statistical methods about likely permutations, retrieves position data for reference alignment, and determines alignment by comparing mismatches against a predetermined threshold; dependent claims add specific quantitative and structural constraints and extend comparison to shifted subsequences using additional hashed index elements.
Stated Advantages
Index elements are limited based on statistical methods regarding which permutations are most likely to occur, to less than a total possible number of permutations.
Alignment determination is performed using hash lookup to locate reference positions and mismatch thresholding based on the number of mismatched bases being less than a predetermined threshold.
Documented Applications
Detecting SNPs using a consensus derived from aligned sequences with confidence-thresholded mismatches versus a reference.
Identifying species in mixed samples by aligning subsequences to species-specific indices and computing alignment ratios.
Interested in licensing this patent?
