Distinct value estimation for query planning

Inventors

BEHM, ALEXANDERMOKHTAR, MOSTAFA

Assignees

Cloudera Inc

Publication Number

US-12105712-B2

Publication Date

2024-10-01

Expiration Date

2038-04-02

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

The problem of distinct value estimation has many applications, but is particularly important in the field of database technology where such information is utilized by query planners to generate and optimize query plans. Introduced is a novel technique for estimating the number of distinct values in a given dataset without scanning all of the values in the dataset. In an example embodiment, the introduced technique includes gathering multiple intermediate probabilistic estimates based on varying samples of the dataset, 2) plotting the multiple intermediate probabilistic estimates against indications of sample size, 3) fitting a function to the plotted data points, and 4) determining an overall distinct value estimate by extrapolating the objective function to an estimated or known total number of values in the dataset.

Core Innovation

The invention introduces a novel technique for estimating the number of distinct values in a given dataset without needing to scan all values. This technique is particularly important in the field of database technology, where distinct value estimation informs query planners in generating and optimizing query plans. The technique involves gathering multiple intermediate probabilistic distinct value estimates from varying samples of the dataset, plotting these estimates against sample sizes, fitting a function to the plotted data, and extrapolating the function to the known or estimated total number of values to determine an overall distinct value estimate.

The problem being solved arises from the limitations of existing distinct value estimation methods. The naïve approach, which involves scanning each value to count distinct ones, is impractical for extremely large datasets due to resource and time constraints. Sampling-based and probabilistic approaches, while offering some improvements, face issues such as high memory requirements, lengthy processing times, or inaccuracies in the estimates. Inaccurate distinct value estimates can lead to poorly optimized query plans, significantly slowing down query executions. The introduced technique addresses these issues by combining probabilistic estimation with curve fitting and extrapolation, allowing for accuracy, memory bounding, and the ability to handle additions and deletions in data without rescanning all values.

Claims Coverage

The patent includes multiple independent claims defining inventive features related to low-latency query processing and distinct value estimation using two-dimensional extrapolation.

Two-dimensional extrapolation for distinct value estimation

The method obtains the number of distinct values based on a two-dimensional extrapolation from multiple dataset samples each having a sample size and a probabilistic distinct value estimate, enabling accurate estimation without scanning all data.

Plotting and function fitting for distinct value estimation

The method plots multiple dataset samples within a two-dimensional space (value quantity vs distinct value quantity), fits a function to these data points, and determines the total number of distinct values by evaluating the function at the dataset's total value quantity.

Use of merged artificial samples in distinct value estimation

The inventive approach includes generating artificial samples by merging probabilistic distinct value estimates from at least two other dataset samples, increasing the number of data points for more accurate function fitting without full rescanning.

Generation of optimized query plans using statistical information

The method generates an optimized query plan based on statistical information obtained from distinct value estimates and distributes plan fragments to multiple data nodes identified via block location and membership information for efficient execution.

Low-latency query engine architecture and execution

A low-latency query engine implemented on data nodes receives queries, obtains statistical data including distinct value estimates via two-dimensional extrapolation, generates optimized plans, coordinates execution across multiple nodes using massively parallel processing (MPP), and aggregates results efficiently.

The independent claims cover inventive features involving the use of two-dimensional extrapolation of probabilistic distinct value estimates from dataset samples, the use of merged artificial samples to enhance estimation accuracy, generation of optimized query plans based on these estimates, and the implementation of a low-latency query engine architecture that supports distributed query processing and efficient result aggregation.

Stated Advantages

Retains accuracy and memory bounding of probabilistic approaches while overcoming their limitations such as needing to scan all values and requiring multiple passes.

Enables estimation of distinct values with a single pass over data and without scanning all values, leading to resource and time savings.

Supports handling of additions and deletions in datasets without needing to recompute statistics from scratch, enhancing usability for dynamically changing data.

Produces sufficiently accurate distinct value estimates with bounded memory and reduced processing time and resources, representing a significant improvement for database query planning.

Provides improved accuracy in distinct value estimation by generating numerous data points through merging of probabilistic estimator buffers, enabling better function fitting and extrapolation.

Documented Applications

Used in database technology for assisting query planners to generate optimized query plans by providing statistical information such as distinct value estimates.

Implemented in low-latency query engines for distributed computing clusters, such as Hadoop clusters, to support real-time, ad hoc SQL queries over large datasets stored on multiple data nodes.

Applicable within unified Hadoop platforms to facilitate real-time access components that allow interactive SQL queries directly on unified storage layers like HDFS and HBase.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.