Systems and methods for de novo assembly of nucleotide sequence reads using a modified string graph

Inventors

Blattner, Frederick R.Baldwin, Schuyler F.Durfee, Tim J.Miskoski, MadalinaNash, Daniel A.

Assignees

Dnastar Inc

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Publication Number

US-11557374-B1

Patent

Publication Date

2023-01-17

Expiration Date


Abstract

Systems and methods to automatically de novo assemble a set of unordered read sequences into one or more, larger nucleotide sequences are presented. The method involves first creating two identical sets of the reads, dividing each read in both sets into smaller sorted mer sequences and then comparing the mers for each read in set 1 to the mers from each read in set 2 to exhaustively identify overlapping segments. Overlap information is used to construct a modified assembly string graph, traversal of which produces a sorted string graph layout file consisting of all the reads ordered left to right including their approximate starting offset position. The sorted string graph layout file is then processed by a novel multiple sequence alignment system that uses mer matches between all the overlapping reads at a given position to place matching individual bases from each read into columns from which an overall consensus sequence is determined.

Core Innovation

The invention provides a layout assembly system for de novo assembly of a read sequence set. The system divides each read oligonucleotide sequence into one or more read mers and assigns a read sequence index and a read position index to each read mer. It also divides comparing oligonucleotide sequences into comparing mers and assigns a comparing sequence index and a comparing position index to each comparing mer, and generates read mer tables and comparing mer tables sorted by mer sequence.

For each read mer, the system determines whether there is a match between the read mer and a comparing mer that has not been previously compared to a read mer. Read oligonucleotide sequences are ordered with respect to each other by sorting using the matches and at least one of sequence index, orientation index, frameshift, and position index describing locations of comparing mers in the corresponding comparing oligonucleotide sequences. Each match is validated by minimum nucleotide length, overhanging nucleotide constraints, and mismatch thresholds.

Validated matches are classified as proper overlaps or containments based on minimum length criteria and mismatches within and on either side of the inexact matching region. The validated matches form a modified string graph comprised of vertices and edges, where each sequence read is a vertex and each overlap is an edge with a type property and a reverse property. Traversal along vertices comprised of proper overlaps generates a sequence directly from the layout, and graph scoring, reduction, and layout export are described.

Claims Coverage

The independent claims cover four inventive features: indexed mer table generation, match determination and ordering, validated overlap and containment classification, and modified string graph layout generation.

Indexing and sorting mers with sequence and position indices

Divide each read oligonucleotide sequence into one or more read mers, assign a read sequence index and a read position index to each read mer, generate a read mer table sorted by read mer sequence; divide each comparing oligonucleotide sequence into one or more comparing mers, assign a comparing sequence index and a comparing position index to each comparing mer, and generate a comparing mer table sorted by comparing mer sequence.

Match determination with previously-uncompared comparing mers and ordering by match locations

Determine matches by comparing each read mer to a comparing mer that has not been previously compared to a read mer, and order read oligonucleotide sequences using the matches and at least one of sequence index, orientation index, frameshift, and position index.

Validated overlap and containment classification based on length, overhang, and mismatch thresholds

Validate each match using minimum nucleotide length, overhanging nucleotide limits or ratios, and mismatch percentage thresholds, and classify each validated match as a proper overlap or a containment using minimum length criteria and mismatch constraints within and on either side of the inexact matching region.

Modified string graph layout generation using proper-overlap traversal

Form a modified string graph from the sequence reads and validated overlaps, with each sequence read as a vertex denoted as a parent read or a containment read and each overlap as an edge denoted as a proper overlap or a containment with a type property and a reverse property, where traversal along vertices comprised of proper overlaps generates a sequence directly from the layout.

The claims collectively cover dividing read and comparing oligonucleotide sequences into mers with indices and sorted tables, determining and ordering matches using comparison locations, validating and classifying matches as proper overlaps or containments, and using a modified string graph for layout generation by proper-overlap traversal.

Stated Advantages

Enables de novo assembly layout by generating a sequence directly from the layout using proper-overlap traversal of a modified string graph.

Uses validated matches and classification into proper overlaps and containments to support reliable graph construction for traversal-based assembly.

Improves downstream sequence ordering and alignment by producing a sorted string graph layout and performing graph scoring and reduction.

Documented Applications

De novo assembly of a read sequence set using a modified bidirectional assembly string graph.

Examples for E. coli MG1655, Yersinia pestis 195/P, and diploid haplotypes with bubble-derived structural variant detection.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.