Design-time information based on run-time artifacts in transient cloud-based distributed computing clusters

Inventors

Arora, SudhanshuDonsky, MarkLeng, Guang YaoKoneru, NarenShe, ChangSingh, VikasVuppula, Himabindu

Assignees

Cloudera Inc

Publication Number

US-11663257-B2

Publication Date

2023-05-30

Expiration Date

2037-11-09

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.


Abstract

Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.

Core Innovation

The invention introduces techniques for tracking data lineage across multiple distributed computing clusters, including transient computing clusters provisioned temporarily in cloud-based infrastructure, by leveraging operational metadata generated during data processing tasks. This metadata is extracted from clusters and aggregated at a metadata system for analysis, enabling the system to summarize operations at the cluster level even after the transient clusters no longer exist.

The techniques enable the identification and utilization of relationships between workflows—such as dependencies or redundancies—which can be employed to optimize both provisioning of computing clusters and task execution within those clusters. The approach addresses the challenges faced by "schema on read" data processing systems, like Apache Hadoop, where lack of predefined schema and the transient nature of cloud-based clusters make it difficult to track and understand data lineage, resulting in limited visibility into data sources, processing, and workflow structures.

Claims Coverage

The patent includes independent claims focused on methods and systems for inferring design-time information from run-time metadata of transient computing clusters and utilizing relationship data to adjust workflows.

Receiving and processing run-time metadata from transient computing clusters

Receiving metadata including run-time artifacts from a transient computing cluster provisioned temporarily to process data according to a workflow, and processing this metadata to generate design-time information indicative of the design of the cluster, the workflow, and the data processing jobs included therein.

Generating relationship data to adjust workflows

Generating relationship data based on inferred design-time information, which identifies dependencies and/or redundancies within one or more workflows, and using this data as input to adjust workflow configurations.

Extracting and publishing metadata from transient computing clusters via virtual machine instances

Implementing telemetry publishers in virtual machine instances of transient computing clusters to extract metadata and publish it to a queue from which a metadata system receives the data for processing.

Managing cluster groups and configuring cluster provisioning based on relationship data

Designating cluster groups and configuring or reconfiguring provisioning of transient computing clusters to process data workflows based on generated relationship data.

Identifying entities and relationships from extracted metadata to generate data lineage

Processing extracted metadata to identify entities involved in data processing, relationships among entities, and generating data lineage information indicative of the path of data through the entities in the workflow.

Visualizing data lineage with interactive graphical entity nodes

Presenting visualizations including graphical entity nodes linked based on identified relationships to provide intuitive views of data lineage, including cluster association and interactive details about entities.

The independent claims define inventive features for extracting, processing, and utilizing operational metadata from transient computing clusters to infer design-time information, generate and use relationship data for workflow adjustment, manage cluster provisioning intelligently, and provide visualizations of data lineage, collectively enabling enhanced management and optimization of cloud-based distributed data processing systems.

Stated Advantages

Provides automatic collection, visualization, and utilization of upstream and downstream data lineage in distributed data processing systems that use a schema-on-read approach.

Enables tracking and summarizing of operations at the cluster level even after transient clusters are destroyed, preserving visibility into ephemeral cloud-based computing resources.

Facilitates identification of workflow dependencies and redundancies across multiple clusters, permitting optimization of cluster provisioning and workflow execution.

Allows recreation and versioning of workflows and jobs based on run-time metadata, supporting improved management of ad hoc and evolving data processing tasks.

Supports heterogeneous workflows involving multiple processing services, leveraging domain knowledge and cluster architecture information to optimize data processing across different environments.

Documented Applications

Tracking data lineage across multiple distributed computing clusters including transient cloud-based clusters.

Recreating jobs, workflows, and multiple versions of design-time elements based on inferred information from run-time artifacts.

Optimizing workflows and scheduling in distributed computing clusters by identifying dependencies and redundancies between workflow components.

Designating and grouping clusters based on usage patterns and relationships to facilitate monitoring and management.

Visualizing data lineage including relationships between entities, operations, and cluster association via interactive graphical representations.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.