Background format optimization for enhanced queries in a distributed computing cluster

Inventors

Kornacker, MarcelErickson, JustinLi, NongKuff, LenniRobinson, Henry NoelChoi, AlanBehm, Alex

Assignees

Cloudera Inc

Publication Number

US-11567956-B2

Publication Date

2023-01-31

Expiration Date

2033-10-01

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.

Core Innovation

The invention provides a format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine consists of a daemon installed on each data node in a Hadoop cluster, which includes a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter to execute the conversion of data from its original format to the target format.

This conversion to a database-like or condensed format (such as columnar Parquet) enables faster query processing by the low latency query engine, which can directly access the converted data without needing to perform expensive runtime transformations. The converted data coexists with the original data, providing flexibility for efficient querying while retaining the ability to update data quickly.

The device solves the problem in existing Hadoop systems where data is stored in original formats under a schema-on-read model, requiring costly data parsing and transformation during query execution. Unlike traditional relational database management systems (RDBMS) that use schema-on-write enforcing schemas at data write time, Hadoop's schema-on-read is flexible but results in slower query execution. The invention allows the benefits of both schema-on-read and schema-on-write by performing background format conversion to optimized formats that accelerate SQL-like queries without impeding the flexibility of data updates.

Claims Coverage

The patent includes three independent claims covering a method, a computer system, and a non-transitory machine-readable storage medium for performing queries in a distributed computing cluster using converted data formats.

Creation of query fragments based on availability of converted data

A query engine at a first data node creates query fragments depending on whether converted data converting original data to a target format specified by a schema is available at that node. The converted data interacts with other data nodes in forming a peer-to-peer network.

Execution of query fragments on data matching converted or original format

The query engine causes execution of query fragments on data corresponding to the format (converted or original) for which the query fragment was created, based on the associated schema information.

Aggregation of intermediate results across nodes

The query engine obtains intermediate results from the execution of query fragments and aggregates these intermediate results with others from query engines on other nodes in the distributed cluster for the client.

The independent claims collectively describe a distributed query processing method and system that dynamically utilizes data in converted optimized formats to perform low latency queries, efficiently coordinating execution and aggregation across a peer-to-peer network of data nodes.

Stated Advantages

Enables fast searches by preparing data in an easily queryable, condensed format.

Provides flexibility allowing rapid data updates and efficient operation on stabilized data.

Reduces query processing time by avoiding expensive runtime data transformations through background format conversion.

Minimizes resource interference by scheduling format conversion at carefully selected times.

Supports both schema-on-read flexibility and schema-on-write efficiency in Hadoop environments.

Documented Applications

Performing low latency, real-time ad hoc SQL-like queries on big data stored in Hadoop clusters.

Supporting a unified Hadoop platform that executes both batch-oriented MapReduce jobs and real-time distributed low latency queries.

Enabling flexible data analysis by allowing users to experiment with different data schemas without incurring large data upload overhead.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.