What is Databricks Data Lineage?
Databricks Data Lineage is an internal feature of the Databricks platform’s Unity Catalog, enabling you to see the data flow for any data workload or asset contained within the Databricks platform.
How does Databricks Data Lineage work?
Databricks Data Lineage relies on the comprehensiveness of its Lakehouse Platform for multi-cloud data storage and speedy, reliable data processing via the Delta Lake format, as well as the unified governance of its Unity Catalog for all the data assets contained within or connected to its lakehouse.
When you give this comprehensive, unified platform the ability to understand what query language is saying about how data is supposed to move from one section to another, you can tell it to draw a map of data flow based on what those queries said.
How do you create a lineage for Databricks?
In order to take advantage of the Databricks’ lineage capabilities, you must have:
- The Unity Catalog (in Databricks’ Premium tier) set up and the relevant tables/data assets registered in its metastore
- Queries that use Spark DataFrame or Databricks SQL interfaces
- Permission to access the data asset you want to see lineage for (for tables, you need the SELECT permission)
How can Databricks Data Lineage help your data management?
The improved visibility of your data flow provided by Databricks Data Lineage empowers your data teams to better manage your data through:
- Efficient root cause analysis for any errors discovered in your data pipelines
- Easy tracking of data trails subject to regulatory compliance
- Faciliation of quick and accurate impact analysis before you make any changes within your Databricks lakehouse or pipelines
Because Databricks Lineage is automated and captured in real-time, lineage information is available immediately upon request, without the need for wait time or for time-consuming manual tracing of data flow.
Limitations of Databricks Data Lineage
The scope of Databricks Data Lineage is restricted to the data assets within Databricks or directly connected to the Unity Catalog’s metastore. Since Databricks is usually only one piece of any modern data stack, end-to-end data catalog lineage of data flow from source to target for an enterprise’s entire data landscape cannot be created with just Databricks Data Lineage. In order to have a holistic picture of data flow through the entire data stack, an organization needs an end-to-end automated data lineage solution that will incorporate Databricks Data Lineage into a larger picture.
Also, there are some cases where Databricks Data Lineage will not capture and/or display lineage even for internal assets, for example:
- If a table is renamed
- When the Jobs API runs submit request is used
- For Delta Live Tables pipelines
In addition, lineage collected by Databricks Data Lineage is only available for 30 days after its collection, which can be a problem if a pipeline issue is identified significantly after it occurred.
Want a demo of how an end-to-end, holistic picture of data flow through Databricks and your entire data stack would look? Request one here.