Background format optimization for enhanced queries in a distributed computing cluster
Inventors
Kornacker, Marcel • Erickson, Justin • Li, Nong • Kuff, Lenni • Robinson, Henry Noel • Choi, Alan • Behm, Alex
Assignees
Publication Number
US-11630830-B2
Publication Date
2023-04-18
Expiration Date
2033-10-01
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.
Core Innovation
The invention is a format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine includes a daemon installed on each data node in a Hadoop cluster, comprising a scheduler that determines when to perform format conversion and a converter that converts data from its original format to a condensed format conducive to relational database processing, such as a columnar format like Parquet. The converted data is stored on the data node alongside the original data, both accessible to the LL query engine.
The problem being solved arises from the limitations of Hadoop's existing schema-on-read model, which allows flexible data updates but requires complex and time-consuming parsing and transformation of data during query execution, leading to longer processing times. Existing relational database management systems use a schema-on-write model offering faster queries but lack Hadoop's flexibility. The invention addresses the need for a hybrid approach that preserves Hadoop's flexibility while enabling faster query processing by converting data in the background to more query-efficient formats at scheduled times.
Claims Coverage
The patent includes three independent claims covering a method, a system, and a machine-readable storage medium related to data processing for query execution in a distributed computing cluster.
Scheduled format conversion on data nodes
Data initially stored in an original format on individual data nodes is converted to a target format configured for relational database processing according to a predetermined schedule, and the converted data is stored on the same data node.
Distributed query execution with format-aware processing
A query engine instance runs on each data node in a peer-to-peer network, processing queries at whichever data node receives them, aggregating data processed across nodes, and executing query fragments on the appropriately formatted data.
Columnar target format for enhanced query processing
The target format used for conversion is a columnar format, conducive to relational database processing and optimized for low-latency SQL-like queries on the distributed data.
The independent claims collectively cover a distributed data processing method, system, and computer-readable medium where data on each node is stored in original format, converted on a predetermined schedule to an optimized target format, and queries are executed by query engines running on each node with efficient processing of converted data to achieve low latency in a Hadoop cluster environment.
Stated Advantages
The format conversion engine speeds up query processing by providing data in a condensed format conducive to relational database processing, thereby reducing complex parsing and transformation during queries.
It enables users to quickly update data while still working efficiently with stabilized data, combining advantages of schema-on-read and schema-on-write models.
Performing format conversion in the background at carefully selected times minimizes resource usage and interference with other operations on data nodes.
The invention allows flexible experimentation with varying data schemas without incurring overhead in data upload and update, while enabling efficient extraction of insights through faster query execution.
Documented Applications
The invention is explicitly applied in Apache Hadoop clusters, using components such as HDFS and HBase data nodes, with query engines and format conversion daemons installed on each data node to support batch-oriented and real-time ad hoc SQL-like queries.
It supports SQL applications and clients (e.g., JDBC, ODBC, Hue) that issue queries to execute directly on data stored in various formats in a unified Hadoop storage environment using a low latency query engine enabled by the format conversion engine.
Interested in licensing this patent?