At Octopai, we talk a lot about data lineage. We also talk a lot about how the ability to track it, visualize it, and query it takes much of the drudgery out of the BI (business intelligence) professional’s life. Understanding data lineage can lead to better business decisions.
The term “data lineage” may not mean much to the average person. Even some BI professionals are unfamiliar with the concept. So let’s break it down a bit, because it’s important, whether you’re a producer, consumer, or curator of business data.
Data Lineage in a Nutshell
Data lineage is defined as the journey data takes as it moves from its originating data source. It includes the data journey to the ultimate BI and analytic products created from it.
Why is this important?
In many businesses—especially larger ones that grow by merger and acquisition—the notion of a “single source of truth” is a quaint pipe dream. The information on the executive’s dashboard typically comes from multiple sources and undergoes multiple transformations along the way. When there’s something fishy about the numbers, it’s important to know where those numbers came from, so the problem can be identified and fixed.
Data lineage comes in two flavors. We call them horizontal lineage and vertical lineage.
Horizontal Data Lineage
Horizontal data lineage is what most people think of when they think of data lineage. It’s the horizontal representation of the system-to-system journey of the data used for BI and analytics that documents:
– Where the data came from
– What transformations the data went through as it traveled from its source to its destination
– What stops the data made within (MRR , Stage, DWH)
– Which reports, visualizations, or analyses use the data
When a report or dashboard contains information from multiple data sources, it’s easy to lose track of where a certain description comes from. BI developers (the wiser ones, anyway) have traditionally made sure they track this metadata for each deliverable, so they can more easily locate sources of problems, bugs, and data errors, but, doing this manually isn’t accurate or scalable and sometimes it is simply impossible to know where to look for the whole truth. Moreover, when a BI developer leaves the organization with such critical knowledge that only he or she holds, the BI group is at a major loss.
Without a common store of horizontal data lineage, however, BI developers must repeatedly recreate or re-engineer their own horizontal data lineage information before they can be comfortable with using the data it describes. It’s not a productive use of a BI professional’s time to go through this exercise for every report, analysis, and dashboard.
Vertical Data Lineage – Really?
Yes, yes, your eyes have not deceived you – we are indeed discussing Vertical Lineage. Very different from horizontal lineage, vertical lineage is the column-to-column lineage that enables the BI professional to drill down into each ETL process, stored procedure and report to see how each was created.
Vertical lineage also reveals cross-relational impact analysis. Without a common store of vertical data lineage, users of BI and analytic tools cannot determine whether products are suitable for their purposes. Hunting for it is a major waste of time. An automated vertical data lineage solution eliminates this futile effort by quickly and clearly answering questions such as:
– What is the source of an attribute in my report?
– How was this KPI calculated?
– Why do these two “identical” fields have different values?
The Value of Automated Data Lineage Tools
As the number of data sources grows, the complexity of horizontal and vertical lineage can increase exponentially. Without an automated system for tracking data lineage, BI professionals can find themselves spending less of their valuable time tracking down problems and more doing actual productive BI work. It’s time for BI professionals to say ‘enough is enough’ and reclaim their critical role in the organization.