Interactive identification of similar SQL queries

Inventors

Agrawal, RituparnaSingh, AnupamPandian, Prithviraj

Assignees

Cloudera Inc

Publication Number

US-11645294-B2

Publication Date

2023-05-09

Expiration Date

2037-04-24

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

Systems and methods for very fast grouping of “similar” SQL queries according to user-supplied similarity criteria. The user-supplied similarity criteria include a threshold quantifying the degree of similarity between SQL queries and common artifacts included in the queries. A similarity-characterizing data structure allows for the very fast grouping of “similar” SQL queries. Because the computation is distributed among multiple compute nodes, a small cluster of compute nodes takes a short time to compute the similarity-characterizing data on a workload of tens of millions of queries. The user can supply the similarity criteria through a UI or a command line tool. Furthermore, the user can adjust the degree of similarity by supplying new similarity criteria. Accordingly, the system can display in real time or near real time, updated SQL groupings corresponding to the newly supplied similarity criteria using the originally computed similarity-characterizing data structure.

Core Innovation

The invention provides systems and methods for very fast grouping of similar SQL queries according to user-supplied similarity criteria, which include thresholds quantifying the degree of similarity between SQL queries and common artifacts included in the queries. A similarity-characterizing data structure enables very fast grouping of similar SQL queries. Computation is distributed among multiple compute nodes, allowing a small cluster to compute similarity-characterizing data on workloads of tens of millions of queries in a short time.

Users supply similarity criteria through a UI or command line tool and can adjust the degree of similarity by supplying new criteria. This enables the system to display in real time or near real time updated SQL groupings corresponding to the new similarity criteria using the originally computed similarity-characterizing data structure.

The background problem addressed is the difficulty of analyzing massive logs of SQL queries, which can number in the tens of millions. Database architects face challenges in answering questions such as how many SQL queries are exact duplicates, whether all queries are unique, or how many queries are similar though not exact duplicates. Current techniques are inefficient at handling pairwise comparisons at scale, making interactive and fast identification of similar queries impractical.

Claims Coverage

The patent includes multiple independent claims centered on methods, systems, and computer-readable media for identifying similar database queries using a special data structure and user-supplied criteria.

Using a similarity-characterizing data structure for grouping queries

Accessing a data structure comprising symmetric binary matrices representing Boolean similarity measures between pairs of database queries; retrieving matrices associated with user-specified artifacts and similarity thresholds; applying logic operations across these matrices to produce resultant matrices identifying groupings of similar queries; and retrieving these groupings based on the resultant matrices.

Interactive updating of query groupings

Receiving updated input sets with different similarity criteria from a user and, in real time or near real time, replacing previous groupings of database queries with new groupings that match the updated criteria using the symmetric binary matrix data structure.

Partitioning database queries for efficiency

Mapping database queries into partitions where each partition includes queries sharing at least one artifact and associating each partition with a data structure. Upon receiving user input, identifying relevant partitions based on database logic to retrieve and apply similarity characterizing data structures within those partitions.

Generating the similarity characterizing data structure using LSH

Creating signature vectors for artifacts within database queries; applying a locality sensitive hashing (LSH) methodology to assign signature vectors into hash buckets corresponding to similarity thresholds; upon detecting colliding queries in buckets, setting Boolean similarity measures in the data structure accordingly.

Distributed generation of similarity data structure

Dividing database queries among multiple compute nodes, where each node identifies artifacts and thresholds, determines similarity Boolean measures for query pairs per artifact-threshold combination, and aggregates this information into the similarity-characterizing data structure.

The claims cover a comprehensive approach involving construction and use of binary symmetric matrix data structures, interactive querying and updating by users, efficiency through partitioning and distributed computation, and application of LSH for scalable similarity detection of SQL queries.

Stated Advantages

Enables very fast grouping of similar SQL queries from tens of millions of queries using distributed computation.

Supports interactive user input to adjust similarity criteria and receive updated groupings in real time or near real time.

Reduces computational complexity by employing bitwise summaries, MinHash, and Locally Sensitive Hashing techniques.

Scales efficiently regardless of the number of queries, providing performance that is not constrained by query volume.

Facilitates easier analysis and optimization of database queries by grouping similar queries, usable in lineage diagrams and database optimization.

Documented Applications

Analyzing large SQL query logs for identifying similar queries to understand data access patterns.

Supporting database architects in query optimization such as index and materialized view creation.

Generating lineage diagrams representing groups of similar queries and their associated transformations for easier visualization and management.

Allowing interactive exploration by database architects of query groupings using different similarity criteria via UI or command line tools.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.