Data lineage is the process of tracing the lifespan of data, from source to target. How/when was the data created? By whom? Where is it being processed? In which systems/reports does it reside? Who has access to it? This is all extremely important to know – especially when something goes wrong and the data looks fishy.
Traditionally data lineage runs horizontally (system to system), meaning it shows how data flows between different systems like between databases (DWH), ETL systems and reporting tools. Automated data lineage tools enable data teams to very easily and quickly reverse engineer reports for when they need to locate the source of an error in a report or conduct impact analysis ahead of making a change to a certain process or report.
Data lineage in the ETL is a bit different, and enables even more visibility into the data flow. Vertical lineage, or column-to-column lineage, enables BI & Analytics professionals to drill down into each ETL process, stored procedure and report to see how each was created. It also reveals cross-relational impact analysis which is critical ahead of making changes to an ETL process. Manually conducting impact analysis is extremely tedious and time consuming, but automated vertical data lineage eliminates this futile effort by quickly and clearly answering questions such as:
- Which other ETL processes will be impacted by the change I plan to make?
- What is the source of this particular attribute in my report?
- How was this KPI calculated?
- Why do these two “identical” fields have different values?