Data lake lineage is the ability to track the flow of data into a data lake from its source, and then out of the data lake to its target, including the details of any processing it may have undergone on its journey.
Why is data lineage in a data lake trickier (and more critical!) than in other environments?
The great strength of data lakes lies in their ability to store structured and unstructured data of multiple types and in multiple formats. Moreover, a data lake may contain the same data in several different formats: one raw and unprocessed dataset, one of the dataset after a specific type of transformation was applied, and one of that dataset after it was refined for use in an analytics or machine learning tool.
While the flexibility of data lakes is their primary strength, it can turn into their primary weakness if you don’t have a way of keeping track of what’s going on. Data lineage is, at heart, the ability to track what’s going on, what has gone on, and what will go on with any data point or asset.
The extensibility of tools used to run jobs in data lakes can present a challenge for more simplistic data lineage mapping techniques, leading to the need for a comprehensive data lake data lineage solution that can follow and track metadata across many different processing types and tools.
What tools are most helpful for data lake lineage?
As mentioned above, one of the most critical tools is an automated data lineage solution that is designed to manage the whirl of metadata generated by the movement of data assets in and out of the data lake.
Another helpful tool is an automated data catalog, where each data asset has its own entry with definitions, usage statistics and ratings, integrated data lineage and a collaboration section where users can discuss the asset with responsible parties and subject matter experts. In such a data lake metadata catalog, it is easy to distinguish between different data assets that may get entangled or confused in a data lake. A data catalog for data lake use is an important part of maintaining order in a data lake environment.