Distinct value estimation for query planning

Inventors

BEHM, ALEXANDERMOKHTAR, MOSTAFA

Assignees

Cloudera Inc

Publication Number

US-11663213-B2

Publication Date

2023-05-30

Expiration Date

2038-04-02

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

The problem of distinct value estimation has many applications, but is particularly important in the field of database technology where such information is utilized by query planners to generate and optimize query plans. Introduced is a novel technique for estimating the number of distinct values in a given dataset without scanning all of the values in the dataset. In an example embodiment, the introduced technique includes gathering multiple intermediate probabilistic estimates based on varying samples of the dataset, 2) plotting the multiple intermediate probabilistic estimates against indications of sample size, 3) fitting a function to the plotted data points, and 4) determining an overall distinct value estimate by extrapolating the objective function to an estimated or known total number of values in the dataset.

Core Innovation

The invention introduces a novel technique for estimating the number of distinct values in a dataset without requiring a full scan of all values. This technique involves gathering multiple intermediate probabilistic estimates from varying samples, plotting these estimates against sample sizes, fitting a function to these data points, and extrapolating the function to estimate the total distinct values in the dataset.

The problem being solved is that current approaches to distinct value estimation, especially in large commercial databases with possibly trillions of rows, either require scanning all data (naïve approach) which is computationally prohibitive or rely on sampling or probabilistic methods that either consume large resources, take too long, or produce inaccurate estimates. Inaccurate estimates can lead to poorly optimized query plans and slow query execution. The invention aims to provide an accurate, bounded-memory, efficient, and single-pass distinct value estimation technique that can accommodate additions and deletions of data without full recomputation.

Claims Coverage

The patent includes several independent claims covering methods, systems, and computer-readable media for distinct value estimation techniques integrated with query planning and execution.

Estimating distinct values using extrapolation within a two-dimensional space

Scanning a subset of values in a dataset to generate a value count indicator and a probabilistic estimator buffer, and then estimating the number of distinct values by extrapolating in a two-dimensional space of value quantity versus distinct value quantity from at least one data point containing the value count and an intermediate distinct value estimate.

Generating multiple data points by maintaining and merging buckets

Creating multiple buckets each having a value count indicator and probabilistic estimator buffer, generating corresponding data points for each bucket, and producing additional data points by merging two or more buckets to improve estimation accuracy.

Fitting and selecting an objective function for distinct value estimation

Fitting various objective functions to the plotted data points using curve fitting and statistical analysis to identify the function that best represents the relationship, and then extrapolating this function to the total dataset size to estimate the number of distinct values.

Integration of distinct value estimation into query planning and execution

Using the estimated number of distinct values to generate a query plan, including generating plan fragments, distributing them to data nodes in a distributed computing cluster, and executing the query plan to produce query results.

The claims cover innovative processes and systems using intermediate probabilistic estimations combined with function fitting and extrapolation for efficient and accurate distinct value estimation, enabling optimized query planning and execution in distributed database environments.

Stated Advantages

Retains accuracy and bounded memory similar to current sampling and probabilistic approaches, while not requiring scanning all values and only needing a single pass through data.

Supports handling additions and deletions of data without needing to fully recompute statistics each time, enhancing statistics maintenance.

Allows improved accuracy through generating multiple data points from a fixed number of sampled buckets and merging buffers, efficiently producing a more accurate distinct value estimate.

Significantly reduces processing time and computing resource usage compared to full scans, making the approach practical for very large datasets.

Documented Applications

The technique is particularly applied in database technology to assist query planners in generating and optimizing query execution plans based on accurate distinct value estimates.

Implementation examples describe integration with a low-latency query engine for Hadoop environments, where query planners use the distinct value estimates for distributed query planning and execution over large-scale data nodes.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.