Merging multiple sorted lists in a distributed computing system

Inventors

Lieber-Dembo, AdarLipcon, Todd

Assignees

Cloudera Inc

Publication Number

US-11726743-B2

Publication Date

2023-08-15

Expiration Date

2040-01-28

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

A technique is described for merging multiple lists of ordinal elements such as keys into a sorted output. In an example embodiment, a merge window is defined, based on the bounds of the multiple lists of ordinal elements, that is representative of a portion of an overall element space associated with the multiple lists. Lists of elements to be sorted can be placed into one of at least two different heaps based on whether they overlap the merge window. For example, lists that overlap the merge window may be placed into an active or “hot” heap, while lists that do not overlap the merge window may be placed into a separate inactive or “cold” heap. A sorted output can then be generated by iteratively processing the active heap. As the processing of the active heap progresses, the merge window advances, and lists may move between the active and inactive heaps.

Core Innovation

The invention describes a technique for efficiently merging multiple lists of ordinal elements, such as keys, into a single sorted output in distributed computing systems. It introduces the concept of a merge window, defined based on the bounds (minimum and maximum elements) of the multiple lists. Lists are divided into at least two different heaps based on whether their data overlaps with the merge window: an active or "hot" heap for overlapping lists, and an inactive or "cold" heap for non-overlapping lists.

This approach addresses the problem of inefficiency in existing methods for merging sorted lists, particularly in distributed systems with large data sets and many lists, such as those found in distributed storage engines like Apache Kudu™. Naive ordered scan approaches suffer from high computational complexity, and existing single-heap approaches require all peeked elements to remain in memory, causing high memory usage and increased heap motion as the number of lists grows.

By separating lists into multiple heaps based on the merge window, the introduced technique reduces the amount of heap shuffling and memory required, improving processing latency and memory efficiency. Lists move dynamically between the active and inactive heaps as the merge window advances throughout the merge process, enabling a more efficient ordered scan over multiple lists distributed across nodes in a distributed computing system.

Claims Coverage

The patent includes 20 claims with multiple inventive features focused on methods, systems, and computer-readable media for performing ordered scans using multiple heaps and merge windows in distributed computing systems.

Accessing per-rowset iterators for ordered scan

Accessing multiple per-rowset iterators, each including a run of keys associated with rows of a corresponding rowset in a distributed table.

Maintaining a first heap for iterators overlapping a merge window

Maintaining a first heap consisting of one or more per-rowset iterators that overlap a merge window representing a portion of a keyspace of the per-rowset iterators.

Iteratively processing the first heap to generate a sorted output

Iteratively processing the first heap to generate an output comprising a sorted list of keys from the plurality of rowsets.

Defining merge window based on bounds of per-rowset iterators

Defining the merge window using the lower bounds and upper bounds of the runs of keys in the per-rowset iterators, including starting at the smallest lower bound and ending at the smallest applicable upper bound among the iterators.

Updating bounds and merge window during processing

Updating the lower and/or upper bound of per-rowset iterators during processing and correspondingly updating the merge window based on the updated bounds.

Maintaining a second heap based on upper bounds of iterators in first heap

Maintaining a second heap including entries based on upper bounds of iterators in the first heap, assisting in determining merges and iterator movement between heaps.

Moving iterators between heaps based on comparisons

Comparing lower bounds of iterators in a third heap (those not overlapping merge window) with the top entry of the second heap to decide movement of iterators between heaps to maintain overlap with the merge window.

Performing merge process with popping, copying, updating bounds, and reinserting

Performing a merge where the top iterator from the first heap is popped, its next non-exhausted key copied to output, bounds updated, iterator returned to the heap, and the heap reordered, repeated until merge completion.

Optimizing block copying when only one iterator in first heap

When the first heap contains only one iterator, copying all non-exhausted keys from that iterator as a block instead of row-by-row to optimize processing.

Supporting queries with conditions and use in Apache Kudu™

Applying the method in distributed computing systems, specifically in tables managed using Apache Kudu™, including receiving query requests and using the method to satisfy such queries.

The claims cover methods and systems for performing efficient ordered scans on distributed tables using multiple heaps managed via a dynamically defined merge window based on iterator bounds, including mechanisms for updating bounds, moving iterators between heaps, and optimized merge operations that improve over existing single-heap or naïve approaches.

Stated Advantages

Reduces the number of element comparisons during merge operations.

Decreases the number of elements that must remain resident in memory.

Reduces the amount of heap motion (shuffling of lists up and down within heaps), improving performance as the number of lists grows.

Improves overall processing latency and memory consumption compared to existing naïve or single-heap ordered scan approaches.

Leads to significant reductions in processing times in various scenarios including half-overlapping, non-overlapping, and overlapping keyspace inputs as well as real large data tablets.

Documented Applications

Returning a sorted list of elements to a client in response to a query distributed across multiple tablets in a distributed computing system.

Enabling fault tolerance in distributed computing systems by maintaining order in scans so that processing can resume correctly after node failures.

Increasing data storage efficiency via compaction processes that merge multiple smaller rowsets into fewer larger ordered rowsets.

Performing incremental backups by identifying changes since previous backups, including managing versions of rows sharing the same primary key.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.