Data Lineage Vs Data Provenance

Data Lineage and Data Provenance are not the same thing. Many data engineer and architect use them interchangible but they are two different concept and has its separate meaning.

What is Data Provenance?

Data Provenance (or Data Provenance Document) captures inputs, entity, system and processes that influence the data of interest. This in effect provide a historical record of data and its origin and generate evidences that support forensic activities such as data dependencies and analysis, error/compromised detection/recovery and auditing/compliance analysis.

What is Data Lineage?

Data Lineage is a simple type of “why provenance”. Data lineage can be represented visually to depict data flow and its movement from its source to destination via various changes. How the data gets transformed along the way, how the representation and parameters changed and how the data splits/converges after each hop. A simple representation of data lineage can be shown as dots and lines. Dots are the data containers and lines are the transformation between the data containers.

Data Provenance vs Data Lineage

Data Provenance = Data Lineage (what is the genealogy, history of its journey, where did it begin, how did it come into being, how did it change over time, where has it been, systems it has traveled, any loss or gain) (i.e. data oriented, metadata))+ Extra (the inputs, entities, systems and processes that influenced the data – i.e. process oriented, which can be used to reproduce the data)

Data Lineage