Design-time information based on run-time artifacts in transient cloud-based distributed computing clusters
Inventors
Arora, Sudhanshu • Donsky, Mark • Leng, Guang Yao • Koneru, Naren • She, Chang • Singh, Vikas • Vuppula, Himabindu
Assignees
Publication Number
US-11086917-B2
Publication Date
2021-08-10
Expiration Date
2037-11-09
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.
Core Innovation
The invention introduces techniques for tracking data lineage across multiple distributed computing clusters, including transient cloud-based clusters, by extracting and analyzing operational metadata generated during data processing tasks. This operational metadata is aggregated at a metadata system, allowing for summarization of operations at a cluster level even after transient clusters no longer exist. It facilitates identification of relationships between workflows such as dependencies or redundancies, enabling optimization in provisioning and task scheduling.
Traditional data warehouse systems employing a "schema on write" approach require extensive upfront schema design, which becomes increasingly difficult and inflexible as data volumes and usage evolve. Conversely, bottom-up "schema on read" systems like Apache Hadoop allow ingestion of unstructured data without predefined schemas but lack visibility into data usage, processing workflows, and lineage. This opacity makes verifying system reliability and optimizing distributed database systems challenging, especially as data scale grows exponentially.
Claims Coverage
The patent includes 3 independent claims focused on a method, system, and non-transitory computer-readable medium for processing metadata from transient cloud clusters to identify workflow relationships and optimize processing.
Metadata processing from transient cloud-based computing clusters
Receiving metadata including run-time artifacts from multiple transient computing clusters temporarily provisioned in a cloud environment to process data workflows, and processing the metadata to generate design-time information indicative of cluster, workflow, or job designs.
Identification of dependencies or redundancies between workflows
Using the design-time information generated from multiple clusters' metadata to identify dependencies or redundancies between workflows running on different transient computing clusters.
Workflow optimization based on identified relationships
Optimizing one or both workflows based on the identified dependency or redundancy between workflows, including configuring provisioning of transient clusters accordingly.
Extraction and publishing of metadata via entities in virtual machine instances
Entities (e.g., telemetry publishers) in virtual machine instances comprising transient clusters extract metadata from cluster services and publish it to queues accessible by the metadata system.
Visualization of design-time information
Generating visualizations showing graphical entity nodes representing identified entities and their relationships derived from design-time information, including interactive elements and cluster association indications.
Designation of cluster groups based on workflow relationships
Designating cluster groups that include multiple transient computing clusters based on identified dependencies and/or redundancies between workflows processed in those clusters.
The claims collectively cover techniques and systems for extracting run-time metadata from transient cloud-based distributed clusters, inferring design-time workflow information, identifying inter-workflow dependencies and redundancies across clusters, and optimizing workflows and cluster provisioning accordingly, supported by metadata extraction, publishing, and visualization methods.
Stated Advantages
Provides users with visibility into data processing systems employing schema-on-read, allowing understanding of data sources, transformations, and impacts on downstream artifacts at a detailed level.
Enables automatic collection, visualization, and utilization of upstream and downstream data lineage in distributed and transient cloud-based computing clusters.
Supports recreation of workflows and jobs based on inferred design-time information, facilitating versioning, auditing, and management.
Allows tracking and management of sensitive data such as personally identifiable information by leveraging lineage information.
Facilitates optimization of workflows and provisioning decisions by identifying dependencies and redundancies across workflows performed in multiple transient computing clusters, improving resource usage and cost efficiency.
Documented Applications
Recreating workflows or jobs (including different versions) based on inferred design-time information derived from run-time metadata.
Tracking use of sensitive data such as personally identifiable information (PII).
Optimizing workflows by redesigning job sequences, data processing steps, or service selections.
Determining data lineage across multiple transient cloud-based computing clusters to identify dependencies and redundancies.
Guiding provisioning and scheduling of transient computing clusters based on inter-workflow relationships to improve efficiency.
Visualizing data lineage via lineage diagrams for enhanced user understanding and system management.
Interested in licensing this patent?