Data lineage

Origins and events of data / From Wikipedia, the free encyclopedia

Dear Wikiwand AI, let's keep it short by simply answering these key questions:

Can you list the top facts and stats about Data lineage?

Summarize this article for a 10 years old


Data lineage includes the data origin, what happens to it, and where it moves over time.[1] Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.[2]

It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. Database systems use such information, called data provenance, to address similar validation and debugging challenges.[3] Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis. "Lineage is a simple type of why provenance."[3]

Data lineage can be represented visually to discover the data flow/movement from its source to destination via various changes and hops on its way in the enterprise environment, how the data gets transformed along the way, how the representation and parameters change, and how the data splits or converges after each hop. A simple representation of the Data Lineage can be shown with dots and lines, where dot represents a data container for data points and lines connecting them represents the transformations the data point undergoes, between the data containers.

Representation broadly depends on the scope of the metadata management and reference point of interest. Data lineage provides sources of the data and intermediate data flow hops from the reference point with backward data lineage, leading to the final destination's data points and its intermediate data flows with forward data lineage. These views can be combined with end-to-end lineage for a reference point that provides a complete audit trail of that data point of interest from sources to their final destinations. As the data points or hops increase, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data lineage view would be to be able to simplify the view by temporarily masking unwanted peripheral data points. Tools that have the masking feature enable scalability of the view and enhance analysis with the best user experience for both technical and business users. Data lineage also enables companies to trace sources of specific business data for the purposes of tracking errors, implementing changes in processes, and implementing system migrations to save significant amounts of time and resources, thereby tremendously improving BI efficiency.[4]

The scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually, data governance, and data management determines the scope of the data lineage based on their regulations, enterprise data management strategy, data impact, reporting attributes, and critical data elements of the organization.

Data lineage provides the audit trail of the data points at the highest granular level, but presentation of the lineage may be done at various zoom levels to simplify the vast information, similar to analytic web maps. Data Lineage can be visualized at various levels based on the granularity of the view. At a very high level data lineage provides what systems the data interacts before it reaches destination. As the granularity increases it goes up to the data point level where it can provide the details of the data point and its historical behavior, attribute properties, and trends and data quality of the data passed through that specific data point in the data lineage.

Data governance plays a key role in metadata management for guidelines, strategies, policies, implementation. Data quality, and master data management helps in enriching the data lineage with more business value. Even though the final representation of data lineage is provided in one interface but the way the metadata is harvested and exposed to the data lineage graphical user interface could be entirely different. Thus, data lineage can be broadly divided into three categories based on the way metadata is harvested: data lineage involving software packages for structured data, programming languages, and big data.

Data lineage information includes technical metadata involving data transformations. Enriched data lineage information may include data quality test results, reference data values, data models, business vocabulary, data stewards, program management information, and enterprise information systems linked to the data points and transformations. Masking feature in the data lineage visualization allows the tools to incorporate all the enrichments that matter for the specific use case. To represent disparate systems into one common view, "metadata normalization" or standardization may be necessary.