Design-time information based on run-time artifacts in transient cloud-based distributed computing clusters

Inventors

Arora, Sudhanshu • Donsky, Mark • Leng, Guang Yao • Koneru, Naren • She, Chang • Singh, Vikas • Vuppula, Himabindu

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Assignees

Cloudera Inc

Member

Cloudera, Inc.

Cloudera, Inc. is a global leader in enterprise data cloud solutions, empowering organizations to transform complex data into actionable insights. With a mission to make what is impossible today possible tomorrow, Cloudera delivers a true hybrid data, analytics, and AI platform for any data, anywhere—from the Edge to AI. The company is driven by open-source innovation and is dedicated to advancing digital transformation for the world’s largest enterprises, providing secure, governed, and scalable solutions across hybrid and multi-cloud environments. Cloudera serves a diverse range of industries, including healthcare, financial services, manufacturing, telecommunications, public sector, and more, enabling customers to tackle transformational use cases and drive value through real-time insights. Cloudera is also a vocal advocate for open standards in machine learning operations and governance, and partners with leading organizations to deliver secure, data-driven innovation worldwide.

Publication Number

US-11663257-B2

Publication Date

2023-05-30

Expiration Date

2037-11-09

Abstract

Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.

Core Innovation

The invention introduces techniques for tracking data lineage across multiple distributed computing clusters, including transient computing clusters provisioned temporarily in cloud-based infrastructure, by leveraging operational metadata generated during data processing tasks. This metadata is extracted from clusters and aggregated at a metadata system for analysis, enabling the system to summarize operations at the cluster level even after the transient clusters no longer exist.

The techniques enable the identification and utilization of relationships between workflows—such as dependencies or redundancies—which can be employed to optimize both provisioning of computing clusters and task execution within those clusters. The approach addresses the challenges faced by "schema on read" data processing systems, like Apache Hadoop, where lack of predefined schema and the transient nature of cloud-based clusters make it difficult to track and understand data lineage, resulting in limited visibility into data sources, processing, and workflow structures.

Claims Coverage

The patent includes independent claims focused on methods and systems for inferring design-time information from run-time metadata of transient computing clusters and utilizing relationship data to adjust workflows.

Receiving and processing run-time metadata from transient computing clusters

Receiving metadata including run-time artifacts from a transient computing cluster provisioned temporarily to process data according to a workflow, and processing this metadata to generate design-time information indicative of the design of the cluster, the workflow, and the data processing jobs included therein.

Generating relationship data to adjust workflows

Generating relationship data based on inferred design-time information, which identifies dependencies and/or redundancies within one or more workflows, and using this data as input to adjust workflow configurations.

Extracting and publishing metadata from transient computing clusters via virtual machine instances

Implementing telemetry publishers in virtual machine instances of transient computing clusters to extract metadata and publish it to a queue from which a metadata system receives the data for processing.

Managing cluster groups and configuring cluster provisioning based on relationship data

Designating cluster groups and configuring or reconfiguring provisioning of transient computing clusters to process data workflows based on generated relationship data.

Identifying entities and relationships from extracted metadata to generate data lineage

Processing extracted metadata to identify entities involved in data processing, relationships among entities, and generating data lineage information indicative of the path of data through the entities in the workflow.

Visualizing data lineage with interactive graphical entity nodes

Presenting visualizations including graphical entity nodes linked based on identified relationships to provide intuitive views of data lineage, including cluster association and interactive details about entities.

The independent claims define inventive features for extracting, processing, and utilizing operational metadata from transient computing clusters to infer design-time information, generate and use relationship data for workflow adjustment, manage cluster provisioning intelligently, and provide visualizations of data lineage, collectively enabling enhanced management and optimization of cloud-based distributed data processing systems.

Stated Advantages

Provides automatic collection, visualization, and utilization of upstream and downstream data lineage in distributed data processing systems that use a schema-on-read approach.

Enables tracking and summarizing of operations at the cluster level even after transient clusters are destroyed, preserving visibility into ephemeral cloud-based computing resources.

Facilitates identification of workflow dependencies and redundancies across multiple clusters, permitting optimization of cluster provisioning and workflow execution.

Allows recreation and versioning of workflows and jobs based on run-time metadata, supporting improved management of ad hoc and evolving data processing tasks.

Supports heterogeneous workflows involving multiple processing services, leveraging domain knowledge and cluster architecture information to optimize data processing across different environments.

Documented Applications

Tracking data lineage across multiple distributed computing clusters including transient cloud-based clusters.

Recreating jobs, workflows, and multiple versions of design-time elements based on inferred information from run-time artifacts.

Optimizing workflows and scheduling in distributed computing clusters by identifying dependencies and redundancies between workflow components.

Designating and grouping clusters based on usage patterns and relationships to facilitate monitoring and management.

Visualizing data lineage including relationships between entities, operations, and cluster association via interactive graphical representations.

Abstract
Core Innovation
Claims Coverage
Stated Advantages
Documented Applications
Interested in licensing this patent?

Design-time information based on run-time artifacts in transient cloud-based distributed computing clusters

Inventors

Interested in licensing this patent?

Assignees

Publication Number

Publication Date

Expiration Date

Abstract

Core Innovation

Claims Coverage

Receiving and processing run-time metadata from transient computing clusters

Generating relationship data to adjust workflows

Extracting and publishing metadata from transient computing clusters via virtual machine instances

Managing cluster groups and configuring cluster provisioning based on relationship data

Identifying entities and relationships from extracted metadata to generate data lineage

Visualizing data lineage with interactive graphical entity nodes

Stated Advantages

Documented Applications

Interested in licensing this patent?

Stay Connected with MTEC