Apparatus and method for utilizing pre-computed results for query processing in a distributed database

Inventors

Cameron, Douglas J.

Assignees

Cloudera Inc

Publication Number

US-11151135-B1

Publication Date

2021-10-19

Expiration Date

2036-08-05

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Abstract

A pre-computed result module computes a result prior to receiving a query. The pre-computed result module includes instructions executed by a processor to assess a pre-computation query to designate each identified database source that contributes to the answer to the pre-computation query and corresponding database source metadata. A metadata signature is computed for each identified database source to create a store of identified database sources and corresponding metadata signatures. The query is evaluated to identify accessed database sources responsive to the query. A current metadata signature for each accessed database source is compared to the metadata signatures to identify each updated database source. Re-computed results are formed for each updated database source. Pre-computed results are utilized for each database source that is not updated. A response is supplied to the query using the re-computed results and the pre-computed results.

Core Innovation

The invention discloses a pre-computed result module for distributed databases that pre-computes query results prior to receiving actual queries. This module executes a pre-computation query to identify database sources that contribute to the query results and computes metadata signatures for these sources, thereby creating a store of identified database sources and metadata signatures. When a new query is received, the module evaluates the query to identify accessed database sources and compares their current metadata signatures against the stored signatures to detect updates. It then selectively re-computes results only for updated sources while utilizing pre-computed results for unchanged sources, and supplies a response to the query using both re-computed and pre-computed results.

The problem addressed by the invention arises in distributed databases, which often partition data over multiple worker nodes and store table data in partitions such as daily partitions. Pre-computed results are used to improve query performance by avoiding repeated data scans. However, when underlying data changes, pre-computed results can become stale, leading to incorrect query answers. A full re-computation of all pre-computed results is costly, so it is desirable to re-compute only those results that have changed due to updated data. The challenge lies particularly in identifying changes in data when the data source is controlled externally or does not provide explicit change indicators.

The invention leverages file system metadata—such as file names, sizes, and update timestamps—as automatically maintained artifacts to track changes without additional database-level processing. By computing metadata signatures (checksums) over this metadata, the system can efficiently detect which partitions or input files have changed. This enables the grouping of pre-computed results based on contributing input files and the selective re-computation of only those results affected by changed files. The system thus optimizes query processing in distributed databases by reducing unnecessary recomputation and quickly identifying stale pre-computed data requiring updates.

Claims Coverage

The patent contains one independent claim describing the inventive system with multiple inventive features.

Pre-computation of results based on distributed database partitions

A system with a distributed database implemented across network-connected worker machines, each hosting database partitions composed of source files and associated source metadata including file name, size, update timestamp, and partition identification.

Metadata checksum generation for input files contributing to pre-computed results

Execution of a pre-computation query to determine input files contributing to pre-computed results, grouping results by input files, generating metadata checksums from source metadata, and storing these checksums alongside the pre-computed results.

Query evaluation with metadata checksum comparison to determine result freshness

Evaluation of new queries to identify accessed data sources and comparison of current metadata checksums against stored checksums to decide if pre-computed results can be used or if re-computation is needed.

Selective re-computation and returning of results based on metadata checksum updates

Updating stored metadata checksums for changed or new input files, selective re-computation of results for updated partitions, and returning either pre-computed or re-computed results as the final query response.

Together, these inventive features define a system that efficiently manages and utilizes pre-computed results in distributed databases by tracking data changes via metadata signatures, enabling optimized query processing through selective result recomputation.

Stated Advantages

Enables optimized re-computation of pre-computed results by detecting changes through file system metadata, avoiding costly full re-computations.

Allows quick identification of stale pre-computed data and selective update of only changed partitions or input files.

Leverages automatically maintained file system metadata, requiring no additional database-level logic to track data changes.

Improves query processing efficiency in distributed databases by utilizing both pre-computed and re-computed results according to data freshness.

Documented Applications

Utilization in distributed database management systems where data is partitioned over multiple worker machines and pre-computed results accelerate query processing.

Application in database systems storing partitioned tables as sets of files with metadata for efficient tracking and update of pre-computed query results.

Use cases involving large datasets split by keys such as year, month, and day, where maintaining up-to-date pre-computed aggregates like total sales per store is valuable.

Abstract
Core Innovation
Claims Coverage
Stated Advantages
Documented Applications
Interested in licensing this patent?

Apparatus and method for utilizing pre-computed results for query processing in a distributed database

Inventors

Assignees

Publication Number

Publication Date

Expiration Date

Interested in licensing this patent?

Abstract

Core Innovation

Claims Coverage

Pre-computation of results based on distributed database partitions

Metadata checksum generation for input files contributing to pre-computed results

Query evaluation with metadata checksum comparison to determine result freshness

Selective re-computation and returning of results based on metadata checksum updates

Stated Advantages

Documented Applications

Interested in licensing this patent?

Stay Connected with MTEC