Systems and methods for compressing genetic sequencing data and uses thereof

Inventors

Chandak, ShubhamTatwawadi, Kedar ShriramWeissman, TsachyOchoa, IdoiaHernaez, Mikel

Assignees

Leland Stanford Junior UniversityUniversity of Illinois System

Publication Number

US-12300358-B2

Publication Date

2025-05-13

Expiration Date

2039-08-20

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

Embodiments of the invention are generally directed to compressing genetic sequencing data. In many embodiments, the genetic sequencing data is reordered and encoded based on sequence homology between individual sequencing reads within the genetic sequencing data. Several embodiments are directed to systems to compress genetic sequencing data, and some embodiments are directed to non-transitory, machine-readable media that direct a processor to compress genetic sequencing data. In further embodiments, the genetic sequencing data represents paired-end sequencing data, and several embodiments transmit the data to a remote device.

Core Innovation

The invention provides systems and methods for compressing genetic sequencing data by reordering and encoding sequencing reads based on sequence homology. The technique involves obtaining genetic sequencing data, aligning sequencing reads based on their similarity, and encoding them by generating a reference sequence—a contig—to which reads are aligned at specific positions. The order of the sequencing reads is determined by their alignment to the reference, creating a structured, possibly reordered, set of reads.

This approach addresses the issue of large volumes and significant redundancy within genetic sequencing data, such as that produced by high-throughput sequencing platforms and stored in FASTQ or FASTA formats. Existing compression tools are not able to fully exploit the redundancy in such datasets and often lack support for variable read lengths, scalability, pairing-preserving compression, and lossless recovery. The present invention overcomes these limitations by enabling effective, order-aware encoding and compression for both single-read and paired-end sequencing data, supporting variable read lengths and providing options for both order-preserving and minimally lossy compression.

The system generates multiple data streams describing characteristics of each sequencing read (such as position, quality scores, orientation, etc.), with each stream reordered to match the aligned order of reads. These streams, along with the reordered sequences, are then compressed, possibly in blocks for scalability. The compressed data can be stored or transmitted to remote devices, supporting efficient storage, transmission, and possible reanalysis as references or computational methods evolve.

Claims Coverage

The patent contains three independent claims capturing the core inventive features of the method, system, and machine-readable medium for compressing genetic sequencing data using homology-based reordering and encoding.

Homology-based reordering and encoding of sequencing reads for data compression

The method obtains genetic sequencing data containing a plurality of sequencing reads and reorders the reads based on homology among them. Reads are encoded based on their relative position within the reordered set, yielding a comprehensive sequence (contig). Each sequencing read is aligned to this sequence at its relative position, with the position determined by contiguity with previous reads rather than external reference alignment. Encoded characteristic data streams (such as quality data) are reordered to match this new order. Compression is performed independently on the reordered reads and their characteristic data streams.

System for compressing genetic sequencing data using instruction-directed homology-based reordering and encoding

The system includes a processor, memory, and instructions that direct the processor to: obtain a plurality of sequencing reads, reorder them based on homology, encode each sequencing read based on its relative position within the reordered group (using contigs), generate reordered characteristic data streams, and then compress the reordered sequencing reads separately from their characteristic streams.

Non-transitory machine-readable medium for compressing genetic sequencing data using homology-based reordering and encoding

A non-transitory machine-readable medium contains processor instructions that, when executed, cause: obtaining genetic sequencing data, performing homology-based reordering, encoding reads relative to the reordered sequence group to yield contigs, generating and reordering characteristic data streams, and compressing the reordered sequencing reads separately from their characteristic data streams.

The inventive features are centered on compressing genetic sequencing data through homology-informed reordering, contig-based encoding, production of reordered characteristic data streams, and separate compression of these elements. These methods are implemented as a process, a system, and a computer-readable medium.

Stated Advantages

Achieves higher compression ratios for genetic sequencing data compared to existing methods, compressing data to as little as 2–3% of the uncompressed size.

Supports variable length sequencing reads not handled by many prior art compressors.

Provides scalability for high coverage and large datasets using block-based compression.

Enables pairing-preserving compression, suitable for paired-end sequencing data.

Can perform lossless compression, allowing perfect reconstruction of sequencing data.

Reduces memory (RAM) and time required to compress and decompress genetic sequencing data compared to existing approaches.

Documented Applications

Compression, storage, and transmission of genetic sequencing data generated by high-throughput sequencing platforms, including single and paired-end sequencing data.

Use for reanalysis of sequencing data corresponding to experiments at specific time points or as references and analytical methods improve.

Efficient sharing of genetic sequencing data with collaborators, medical professionals, or for secure off-site backup storage.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.