Ever reflect on what it would be like to be a piece of data that enters your BI system?
Honey, I’m home! Now I’ll just sit down on my recliner and… hey! Where are you taking me? What? You’re changing my name, but “don’t-worry-I’ll-always-be-the-same”? What does that mean? Okay, well, let me just sit down here and… where are we going now? Why do I have to put on sunglasses and a fake mustache? Fine, but… we have to go to a different place now? Can’t I just relax in peace?
It ain’t easy being data.
Then again, it ain’t easy to be a BI developer trying to track data through a stream of twists, turns, transformations, and multiple BI systems.
Mapping data lineage is just that: following your data of choice throughout your BI environment to see where it went and what happened to it since it crossed the threshold.
Look for the Metadata
In order to perform accurate data lineage mapping, every process in the system that transforms or touches the data must be recorded. This metadata (read: data about your data) is key to tracking your data.
In other words, kind of like Hansel and Gretel in the forest, your data leaves a trail of breadcrumbs – the metadata – to record where it came from and who it really is.
Without your metadata trail, you’ll end up as the wicked witch’s next meal.
So the first step in any data lineage mapping project is to ensure that all of your data transformation processes do in fact accurately record metadata.
Let’s Get Mapping
Assuming that all of your BI systems properly keep track of the metadata, it’s time to talk about how to connect all the metadata lineage dots. Depending on the system, you have several options for data lineage techniques:
Pattern-Based Data Lineage
This technique looks for patterns in the metadata across tables, columns, and reports. If Table A and Table B both contain a column with a name that includes the term “Total Revenue” and the data values are very similar, it is likely that this is the same data in two different stages of flow through the system. So pattern-based lineage would link these two columns in a data lineage map.
The advantage of pattern-based lineage is that it doesn’t need to understand the code behind the data transformations. It can work on any system or any technology.
The downside is that it can easily miss more complicated data relationships. If the connection between two datasets was made by a data processing algorithm, it may go over your pattern-based data lineage’s head.
When it comes to data lineage solutions, accuracy is the name of the game. This means that pattern-based lineage is only relevant when you’re dealing with a small number of datasets that have straightforward connections. You can also apply this type of lineage when the database technology you are using is so uncommon that you need a technology-agnostic solution.
Data Lineage by Tagging or Self-Contained Data Lineage
If you have a self-contained data environment that encompasses data storage, processing and metadata management, or that tags data throughout its transformation process, then this data lineage technique is more or less built into your system.
This is a great solution… until you are dealing with more than one system. Since most businesses do have at least two systems involved in the BI environment which encompasses ingestion, processing and querying, or reporting, any end-to-end data lineage will require a different solution.
Data Lineage by Parsing
This data lineage technique reads and understands the algorithms used to process, transform and transmit the data. Unlike pattern-based data, which can only look at a footprint in one location and a footprint in another location and declare them similar enough to be related, data lineage by parsing can follow the footprints wherever they go, even as they change, because it understands why they are changing.
Take a deep dive into data lineage and how automation is key to getting the most out of it
Check out "The Essential Guide to Data Lineage in 2021"
Download the eBookThis technique is far and away the most powerful form of data lineage, facilitating end-to-end data lineage across multiple systems. The caveat, however, is that it needs to understand the programming language and tools in which the transformation and transmission algorithms were written. If you’re lost in Budapest and a kind local gives you precise directions in Hungarian, it doesn’t help if you speak English, French, Chinese, Esperanto, or Pig Latin. If you don’t understand Hungarian, you’re still lost.
Data Lineage Implementation: Automatic or Stick Shift?
A friend of mine once told me she liked driving manual cars with a stick shift because it gave her something to keep her busy with while she was driving. If your BI team is bored and looking for things to keep them busy, manual data lineage might be an option.
Most BI and data professionals, however, are looking to take busy work off their plates so that they have more room to dig into the truly intelligent parts of business intelligence. If that’s you, automated data lineage is the way to go.
By implementing automated data lineage tools, your team will have a visual map of the path your data takes from source to destination, no matter where it goes and what happens to it, in a matter of seconds.
Impact analysis, root cause analysis, migration preparation, and regulatory compliance become infinitely simpler. It’s like being lost in Budapest and a kind local hands you a precise map written in whichever language you speak with a big YOU ARE HERE notation.
Automated data lineage mapping enables you to track any data with precision, regardless of the kind of transformations it undergoes.
Even if it changes its name and puts on sunglasses and a fake moustache.