What is Data Lineage?

In the world today, data has become the norm. It’s truly everywhere, but are we able to fully comprehend what our data is telling us? Are we even seeing the whole picture?


Probably not. Unless you’re using a tool that provides automated data lineage, the story you’re able to understand is, well, let’s just say incomplete. Why is that? Keep reading to find out.


In this article, you will learn:


Data Lineage Definition

We can define data lineage as the data’s life cycle or the full data journey. This life cycle includes: where the data originates, how it has gotten from point A to point B, and where it exists today. 


By utilizing data lineage, organizations can better understand what happens to data as it travels through different pipelines (ETL, files, reports, databases etc.). During its journey, data interacts with other pieces of information, is transformed, and is utilized in various reports. This allows businesses to make more informed decisions. It also enables companies to trace sources of specific business data in order to track errors and implement changes in processes as well as to streamline system migrations. This saves organizations significant amounts of time and resources, thereby tremendously improving efficiency and speeding up time to insights. Without understanding their data lineage tracking, companies aren’t able to predict the impact certain changes might have on various reports or ETL processes throughout the data environment. This means that they are dealing with an uncontrolled environment. This can be detrimental to a company because they can’t fully understand where their data came from and what happened along the way, nor can they extract value from their data.


How Is It Generated?

Now that we have established just how important it is to understand the origin and flow of a company’s data, you may be wondering how organizations can obtain this information. 


Well, that’s the problem – data teams today tend to have to manually map out lineage since they are usually dealing with multi-vendor environments. For example, if they’ve got Informatica (ETL), Oracle (DWH) and Tableau (reporting), each has its own metadata labeling system managed by different teams. Trying to figure out where a specific data element in a Tableau report came from can be impossible, and if not completely impossible, you can bet it will take data analysts a really LONG time to figure out. 


Instead of manually compiling all of the data sources, companies can utilize an automated data lineage tool. This allows data teams to receive relevant information from all of their different data sources, instantly.


Impact Analysis in Octopai. Data lineage from ETL through DB and Analysis Services to Reporting.


What Does Data Lineage Look Like?

Data lineage is a visual representation of the overall flow of data. It provides a look at how data is manipulated via the ETL process. This allows organizations to assess the quality of their data before it is loaded into an analytics tool. Data lineage is primarily a visualization of the journey of different data points.


If you are still unable to picture this, refer to the image at the top of this post. It shows how Octopai’s automated data intelligence platform compares and presents the lineage of two different reports (either from the same system or different systems – in this case it’s SSRS and Power BI). 


This clearly illustrates any variations between the reports and enables users to quickly understand exactly how two or more reports ended up showing differing results. Specifically, we see that an additional ETL process and table were found in the report on the bottom of the picture, which is missing from the report illustrated on the top. This is the point where the two reports began to diverge.


Data lineage also exists at two different levels – horizontal and vertical. The higher-level view is horizontal because it shows the big picture of how data flows between systems. To drill in deeper, data teams must look at the vertical view. They can sift through layers of data until they reach the column-to-column (or column-to-report) level. Vertical data lineage is helpful for such things as solving report discrepancies and getting a more comprehensive understanding of what exists in an environment that is going to be migrated to a new system. 

Sick of all the manual mapping required to sort out inconsistencies in your data?
Automated Data Lineage can fix that
Learn How


How Metadata Fits Into The Process

Whereas data lineage is the visual representation of how the data flows throughout various systems, the actual data presented in the lineage must first be located and verified. This is done through metadata management


Metadata is essentially the information about a company’s different assets and the relationships between them. With metadata, we are able to find all data items related to a specific report or ETL process, see all the dependencies related to it and then, trace its entire lifecycle.


In short, metadata is to data lineage what wheels are to a car. Metadata is what makes it possible to have data lineage, and the demand for tools to effectively manage metadata is growing rapidly.


Root Cause Analysis in Octopai. End-to-end data lineage from source systems to report through the entire data landscape.

Introducing Data Lineage XD – Multilayered Data Lineage

Octopai’s Data Lineage XD goes beyond the standard one-dimensional data lineage. Instead, it offers three different layers of data lineage: cross-system lineage, end-to-end column lineage, and inner-system lineage. 


Cross-System Lineage:

Cross-system lineage provides end-to-end lineage at the system level from the entry point into the data landscape, all the way to reporting and analytics. This type of lineage provides high-level visibility into the data flow – mapping where data is coming from and where it is going.


Top uses for cross-system lineage include:

  • Predicting the impact of a process change
  • Analyzing the impact of a broken process
  • Discovering parallel processes performing the same tasks
  • High-level visualization of data flow


End-to-End Column Lineage:

Cross-system lineage is perfect for the big picture. But what about when you need to zoom in and see the details? The end-to-end column lineage details column-to-column-level lineage between systems from the entry point into the data landscape, all the way through to reporting and analytics. 


Top uses for end-to-end column lineage include:

  • Impact analysis of a change to a column in the source system
  • Root cause analysis to uncover the source of reporting errors
  • Column-level visualization of data flow
  • Regulatory compliance audit preparation


Inner-System Lineage:

Sometimes you need to dive even deeper into the nitty-gritty details of one particular system. The inner-system lineage details the column-level lineage within an ETL process, report, or database object. Understanding the logic and data flow for each column provides visibility at the column level no matter how complex the process, report, or object. 


Top uses for inner-system lineage include:

  • Visualizing the logic of a report, ETL, or database object data flow
  • Locating dependencies within a report


If you’re looking for the most in-depth, complete data lineage tool, look no further than Data Lineage XD.

Some Data Lineage Use Cases

Discovering Root Cause of Reporting Errors. If the sales team is claiming a deal flow that simply doesn’t align with the Finance Department, you can be sure that the data team manager will be asked to get involved. The data team has to find out why the sales numbers are different than the finance numbers. They are able to visualize the entire data flow and determine root cause and impact analysis in just a few moments. With automated data lineage, data teams no longer need to fear having to prove data accuracy in their reports. They can easily utilize data lineage to pinpoint the data in question and explain where it came from and any modifications it went through. Whether an error exists or not, data professionals can feel confident in their explanation and provide this answer within a few minutes. Through the help of an automated data lineage tool, the business user will rest assured that all data is accurate and understood.


System Migration & Upgrades. Migrating seamlessly from a legacy data tool to a modern one or upgrading to a new version of a system can be made significantly easier and streamlined by advanced data lineage that enables data teams to get full visibility into their data environment. With automated lineage capabilities, teams can visualize which reports or ETL processes are duplicates, and which rely on data sources that are obsolete, questionable or non-existent so that they are able to reduce the number of data items that must be migrated – no need to migrate dups or obsolete reports, right?  Lineage visualization not only reduces time, effort and error in this process but also enables faster execution of the migration project. 


According to Forbes, lineage analysis helps identify “islands of data” that are not currently in use. This allows companies to understand the data they are actually utilizing and stop wasting money, time and effort on irrelevant stored information.


Data Privacy Regulations. When it comes to compliance, whether it’s GDPR, the California Privacy Rights Act (CPRA), or any of the numerous personal protection compliance acts on the horizon, you need to gain a better grasp of your data. In order to do that, you must have a data lineage tool in place. It is vital that you know where every single piece of your data originated. This is essential when it comes to protecting personal information. In order to remain compliant, data lineage can help the data team identify a data element as PI so they can flag this and track all data items related to it. With this capability, companies will remain organized, transparent, and compliant. 


Impact Analysis. Before implementing a change, companies must understand what reports, data elements, or users will be affected. Through the use of automated data lineage, data teams can identify the data objects downstream and see the potential impact. They can also pinpoint which business users interact with this data and how they will be impacted. By recognizing who and what will be influenced by this change, they can then decide if they should follow through with the modification.


You must also have a clear understanding of any change the data encountered along the way. Knowing the entire story behind each and every data item is a clear case of ‘knowledge is power.’  The more information an organization has regarding its data, the better prepared it will be for the future.  


Top Benefits of Data Lineage

Data lineage is a powerful tool in helping organizations excel and stand apart from the competition. The importance is seen in its impact on both an organization’s function and its reputation.


Build trust in your data

The mandate to “become a data-driven organization” is almost cliché; if you have data, what else would you want driving your organization? Tarot cards? Random guesses? 


But it’s not enough to have the data. You need to have trustworthy data (otherwise it’s not much better than the “data” found on tarot cards). Transparency is critical to trust, and transparency is exactly what data lineage gives you. Where did that number come from? What happened to it along the way? What other data did it affect? With transparent answers at the ready, your organization’s data users will be able to rely on your data and use it to drive the business forward.


Faster incident resolution

Mistakes happen to the best of us. The number on the report was wrong; the figure in the account was off; there was an error in the result of the calculation. But it’s how you handle the mistake that sets you apart. Can you track down the source of the mistake and correct it within minutes or hours of discovering it? Or does it take days of work, all while a client or management is waiting impatiently for the resolution?


When it comes to figuring out “what went wrong where,” data lineage is your biggest asset. A comprehensive automated data lineage solution gives you the ability to track errors to their source and deliver answers and corrections in record time. 


Meet compliance standards with ease

When new regulatory compliance standards are issued or updated, is your team overjoyed at the enhanced protection available for the public – or frustrated at the enhanced headache this is going to cause for them? More standards mean more data management, more data tracking, more data masking, more audit preparation… more “stuff” to stuff into an already burgeoning workload.


One of the benefits of data lineage is the ability to easily find, track and manage data that is subject to regulation. When the flow of data can be followed clearly from source to target, identification of any point of regulated data (e.g. PII) on the flow will facilitate simple identification of every other data point derived from that regulated data. Additionally, with the ability to show what happens to data as it moves through your systems, audit preparation and proving compliance become infinitely easier.  


Build trust in your organization

When you can make accurate, data-based decisions, address data issues in a transparent and timely fashion and have a stellar compliance record, your organization’s reputation will precede it. 


People do business with organizations they trust. One of the primary benefits of data lineage is in establishing that trustworthy reputation and becoming a go-to in your industry. 


The Importance of Data Lineage in Data Governance

Data governance is the rules and roles that control how data is managed within your organization. The goal of these rules and roles is to ensure that your data can be used properly (e.g. security, access, compliance) and effectively (e.g. quality, operability).


We sometimes hear people asking, “What is data lineage vs. data governance?” but the more accurate question is “What is data lineage in data governance?” Data lineage is a tool that enables the effective implementation of data governance in almost every area:


Quality

To ensure high data quality, you need to be able to assess the original source from which the information entered your data landscape, the quality level of that original source, what transformations the data underwent on its subsequent journey through your systems and any potential problems caused by transformations or data interactions. Data lineage is precisely that: a tool that offers transparency to every point and aspect of your data’s journey.


Security and Access

The only thing worse than people not being able to access the information they should have is people being able to access the information they shouldn’t have. In order to prevent both of these access issues when you’re dealing with large quantities of data, you need an automated method to classify your data and identify the appropriate access level – or at least to do the initial work, even if a human needs to check and approve the classification. The ability of data lineage to trace a data asset’s journey throughout your entire data landscape makes identification of personal, secure or otherwise sensitive data much easier. As long as one point in the journey is identified as sensitive data, all other stops can be marked as sensitive and the appropriate access permissions granted or policies applied.


Compliance

No matter what industry you are in, from finance to healthcare to insurance, you most likely have regulations you need to comply with. A large part of data governance is setting policies to ensure compliance – and then proving you did, in fact, comply. Data lineage gives you reliable data to use in your calculations and in your reporting. Almost as important, it gives you the ability to effortlessly prove accuracy to any auditor.


How is Data Lineage Used in Different Industries?

While some uses of data lineage are common to almost anyone dealing with significant amounts of data, others are particular to an industry’s unique needs. Here are some real-world examples of industry-specific data lineage uses: 


Data lineage in Healthcare

Healthcare data needs to be easily accessible to the providers who need it while staying completely inaccessible to anyone unauthorized (read: HIPAA, et al.). The tracing capacity of data lineage is key to the seamless protection of personal healthcare information. Identify one point of PII in your data systems, and a data lineage tool will be able to follow its trail through every system it touches, making it possible to automate access rules and masking protocols.


In healthcare, data often supports life-or-death decision-making processes, making data accuracy crucial. Additionally, in the unfortunate event that decisions don’t lead to their hoped-for outcomes, being able to trace the decision process back and determine where the error lay is indispensable to an organization’s future success (as well as its prevention of legal issues).


Data lineage in Banking

Over the past 15 years, new regulations for the banking industry have come fast and furious. BCBS 239 (and its sub-regulations like Basel III and IV and FRTB) aim to standardize the way that financial institutions evaluate, prepare for and report on all kinds of risk, including credit risk, operational risk and market risk.


Regulator scrutiny is higher when it comes to banks’ risk evaluations and reports – and much, much higher when banks want to use their own internal models (when that’s even allowed). If financial organizations don’t want to spend inordinate human resources confirming and proving their data’s accuracy, they’ll need to implement a solution like automated data lineage to help. 


Data lineage in Insurance

Like the banking industry, the insurance industry has seen increasing regulations over the past decade and a half. IFRS 17 and Solvency II, among others, raise the bar by which insurance companies evaluate the worth of their insurance contracts and estimate how much capital they need in reserve. All these calculations must now be strictly data-based, some of them requiring up to 10 years of historical data. Demonstrating that their data is accurate and that their models can be accepted by regulatory compliance officers is where data lineage can save inordinate amounts of time and effort for insurance companies.


In a healthcare insurance-specific use, data lineage is the key to pinpointing changes in pre-adjudicated and post-adjudicated claims. The procedure codes can change during the claim adjudication process – either automatically or by a manual processor – so for a claims analyst who needs the complete picture, data lineage is the only thing that will show exactly what happened in a procedure code field over the history of the claim.


Updated December 2022

Is your organization struggling to get full, accurate data lineage?
Octopai gets it done in 5 seconds, enabling data teams to double their capacity
Schedule a Demo to See How

Start saving time with a user-friendly, cross-platform data intelligence platform

See the full data journey clearly. Trace errors easily and make smarter decisions faster.

-is your lineage 2023 ready? - download our ebook to find out

Announcement ! We are happy to share that Octopai has been acquired by Cloudera