Design-time information based on run-time artifacts in a distributed computing cluster
Inventors
Singh, Vikas • Arora, Sudhanshu • Zeyliger, Philip • Vanzin, Marcelo Masiero • She, Chang
Assignees
Publication Number
US-11663033-B2
Publication Date
2023-05-30
Expiration Date
2037-11-09
Interested in licensing this patent?
MTEC can help explore whether this patent might be available for licensing for your application.
Abstract
Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.
Core Innovation
The invention provides techniques for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. A metadata system extracts metadata from services in the cluster as they process workflows comprising multiple jobs, and processes this metadata to identify entities and relationships that generate data lineage information. Using this lineage information, the system infers design-time information about the workflow, which can be used to recreate workflows, recreate previous versions, optimize workflows, and more.
The background section identifies the problem addressed by the invention as the difficulty in managing and understanding data workflows in distributed data systems using a "schema on read" approach (such as Apache Hadoop). Traditional data warehouses use a "schema on write" approach requiring significant upfront schema design, which becomes increasingly cumbersome with scale. In contrast, "schema on read" systems allow loading unstructured data without predefined schemas but leave users with little understanding of the data processing operations, lineage, and structure. There is a need for automatic collection, visualization, and utilization of upstream and downstream data lineage in these distributed systems to verify reliability and optimize processing.
The invention addresses these challenges by automatically extracting operational metadata generated during data processing jobs and workflows in the distributed computing cluster. This metadata includes run-time artifacts such as logs, temporary tables, and execution details. By analyzing this information, the system identifies entities involved and relationships among them, builds a graph model representing data lineage, and infers the corresponding design-time information such as schema, jobs, workflows, services, and resources involved. This enables users to better understand, manage, and optimize workflows without requiring prior knowledge or predefined schemas.
Claims Coverage
The patent includes multiple independent claims covering methods, computer-readable media, and computer systems. The main inventive features involve processing run-time metadata to infer design-time information, generating lineage visualizations, and utilizing this information for various workflow management tasks.
Inferring design-time information from run-time metadata
Receiving metadata generated during workflow execution that includes run-time artifacts; determining data lineage based on the metadata; generating design-time information indicative of the design of the system, workflow, or jobs based on the data lineage; and generating and displaying a data lineage visualization.
Identifying entities and relationships from metadata for lineage determination
Identifying multiple entities involved in execution of data processing jobs based on metadata; determining relationships between these entities, including files, directories, tables, scripts, query and job templates and executions; and relationships such as data flow, parent-child, logical-physical, or control relationships.
Optimizing workflow structure based on inferred design-time information
Optimizing the workflow by changing the data processed, the sequencing or scheduling of jobs, or the services used to store and process the data to improve data processing and storage efficiency.
Versioning workflows by determining structure of previous workflow versions
Determining structures of previous versions of workflows based on the inferred design-time information derived from the run-time metadata.
Configuring the system to log and prepare metadata for extraction
Setting the computer system to generate the necessary logs and prepare the metadata in a format that facilitates extraction and processing by the metadata system.
Visualizing data lineage interactively
Generating data lineage visualizations that include graphical entity nodes representative of entities, linked based on their relationships and including interactive elements to display entity information upon user interaction.
Handling heterogeneous workflows
Supporting workflows composed of multiple different types of data processing jobs performed by different services within the computer system.
Determining workflow processing of sensitive data
Using design-time information to determine whether workflows involve processing personally identifiable information (PII).
The patent claims comprehensively cover systems and methods that extract run-time metadata to infer detailed design-time information and lineage, enabling visualization, versioning, workflow recreation, optimization, and sensitive data tracking within distributed computing clusters, especially those using schema-on-read architectures.
Stated Advantages
Provides users with visibility into complex data workflows in schema-on-read systems, which traditionally lack upfront schema and are difficult to understand and manage.
Enables automatic collection, visualization, and utilization of upstream and downstream data lineage from run-time metadata without requiring prior knowledge or manual schema definitions.
Supports recreating workflows and previous workflow versions based on execution metadata, facilitating auditability and reproducibility.
Allows optimization of workflows by leveraging inferred design-time information and knowledge of the computing cluster architecture to improve data processing and storage efficiency.
Facilitates detection and tracking of sensitive data usage such as personally identifiable information within workflows.
Documented Applications
Recreating workflows and multi-job sequences based on run-time metadata.
Recreating previous versions of workflows to view or execute historic workflow states.
Optimizing workflows by modifying data processing operations, job scheduling, or services used.
Visualizing data lineage through diagrams that graphically represent entities and their relationships, down to column level in data sources.
Tracking and determining if workflows process sensitive data such as personally identifiable information (PII).
Interested in licensing this patent?