Video Transcript
Shannon Kempe: Hello and welcome. My name is Shannon Kempe and I’m the Chief Digital manager of DATAVERSITY. We’d like to thank you for joining this DATAVERSITY webinar, How to Accelerate BI Responsiveness with Data Lineage. Sponsored today by Octopai. Just a couple of points to get us started, due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom middle of your screen, or if you’d like to tweet, we encourage you to share highlights or questions via Twitter using #DATAVERSITY.
If you’d like to chat with us or with each other, we certainly encourage you to do so. Again, to access an open the Q&A or the chat panel, you will find those icons in the bottom middle of your screen for those features. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and any additional information requested throughout the webinar. Now, let me introduce you to our speakers for today, David Loshin and David Bitton.
David Loshin is President of Knowledge Integrity, is globally recognized as an expert in business intelligence, data quality, and master data management, frequently contributing to intelligence enterprise, DM review, and The Data Administration Newsletter, tdan.com. David Bitton has extensive product knowledge, coupled with creative ideas for product application, and a solid history of global sales and marketing management of software as a service and internet-driven products. With that, I will give the floor to David Loshin to start today’s webinar. Hello and welcome.
David Loshin: Thanks, Shannon. Thanks to DATAVERSITY and to Octopai for inviting me to speak on this topic. It’s a topic that I’ve been familiar with for a long time, although what we will see as I walk through some of the slides is that the perceptions of the value and the utility of Data Lineage have truly accelerated over the last couple of years and that’s contributing to great value that can be provided to accelerating BI responsiveness.
Actually, to start off, I think it’s actually really good to begin a talk about business intelligence responsiveness by reflecting on a historical perspective about the use of information for decision-making. An interesting side note, recently, I’ve been doing a lot of reading about information value, and decision-making.
Going back over decades of articles and research papers, and magazine articles, et cetera about information value. Many of them, if not all, all of them reflect what I’ve got here, this quote from this article called The Value of Information that was written in 1968, that says, “Ideally, information is the sum total of the data needed for decision-making.”
When I read through a lot of these papers, there are some common themes that appear over that five-decade-plus span about data usability within and across the enterprise for the purposes of making good decisions. These are namely what I show on the right side of the slide, information awareness or knowing what data sets exists and could be of value for decision-making and business intelligence. Information availability, which is basically the knowledge about whether the data sets that are available for use and under what circumstances or restrictions are in place for using that information. Information trust, or the degree to which that data in those data assets are trustworthy.
Accessibility, or how that information is actually made available, how I can get access to that. Information currency and information freshness are both looking at how frequently data is refreshed and whether it’s kept up to date. Then, the last one, there’s information format, or whether the data is in a format that can be easily used. I think that anybody who’s worked on any data warehouse, data mart, data visualization, reporting, analytics, any of those types of applications, is intimately familiar with some of the issues that arise from any one of these items.
Organizations that have traditionally used a data warehouse have attempted to finesse these issues by creating a set of well-defined data extraction and data preparation pipelines. As we will see, this tightly structured– The traditional data warehouse architecture is gradually disintegrating. I think what we’ll discuss is how the traditional or conventional approach to populating data into a data warehouse is beginning to be no longer the status quo.
If you look at this picture, in a conventional data warehouse, there’s going to be well-defined data processing pipelines. Data is being extracted from existing operational or transaction processing systems that are typically known sources that are well within the corporate data center. The data is extracted usually in managed batches.
Those datasets that are extracted are then typically moved to some staging area, sometimes an operational data store where that data is then processed. It’s standardized, it’s cleansed, it’s transformed, it’s reorganized. It’s then prepared for loading typically all in batch into this target data warehouse. It’s forwarded to the data warehouse. That data warehouse is a single resource that’s then shared by the different downstream consumers. For the most part, all these processes are managed by an IT team or a small team that is the data warehouse guardians you might say.
There are these well-defined processing pipelines that are transforming the data from the original formats to one usable format for reporting and business intelligence and there is some control that is exerted over those data production processes. There’s some level of oversight and control. To that extent, the fact that we’ve limited ourselves to a smaller domain of sources that can be fed into a BI environment gives us some level of control.
However, what seems to be happening in recent years is that enterprise data strategies become more complex. It’s really driven by three evolving, and continuously evolving realities. One is a lower barrier of entry for scalable high-performance platforms, especially when using cloud resources that don’t have any capital acquisition costs and can be scaled up according to demand. A large number of organizations are looking at migrating their environments to the cloud because it’s lower cost and it’s more economically feasible.
Number two is the available of low-cost, or more frequently, no-cost opensource tools that simplify the analytical process. Years ago, when you had analysts, they were using particular types of end-user reporting and BI tools, there were licensed constraints, there were limitations on availability of licensing to the desktop. You might have a need for particular types of expertise to be able to use these systems.
Today, we’ve got all sorts of available tools at low cost, no cost, that data consumers are able to configure on their own. We’ve got an increasing degree of sophistication of the data communities. That leads into the third point, which is that there’s a broader array of personas that are positioned to derive value from the information in analytics.
Aside from your traditional data analysts and your business analysts, you’ve got a number of different staff, folks, team members with a range of skills that go from being basically a neophyte when it comes to data analysis, reporting and BI, to expert data scientists who are, who are knee-deep and hands-on with the data. They’re all beneficiaries of the business intelligence reporting and analytics environments.
The result turns into this virtuous cycle where you’ve got a greater demand for analytics, and that means modernizing the data strategy. That means growing the enterprise data landscape. Growing the data landscape means there’s greater numbers of data sources, and greater number of data pipelines, and a much more diverse distribution of the way that data is stored, managed, accessed, flowed, piped across this more complex enterprise.
When you get this increase in data sources and data pipelines, it inspires downstream data consumers to be aware of more data sources, which makes them want to explore more ways that they can use the data to inform their decisions by creating new reports, and new analyses, and applying the data by the data scientists in different ways. We get this virtuous cycle where the more data that’s available, the more demand there is for more data, which then continues to increase the complexity of that enterprise.
The challenge here is that it makes the environment more complex to support the analytics demands. This becomes an issue, because as the environment becomes more complex, you’ve got this growing number of sophisticated data consumers, each one of them now wants to exercise control over their own data pipelines.
Instead of having the way we used to have it in the traditional approach where you’d have an IT team that oversaw all the progression and pipelines of that extracted data from the sources, did the transformations and loaded the data into the target environment, now you’ve got these different data consumers and there’s a much more, you’d say, distributed control over those environments where each downstream user may be able to get access to data and be able to do their own data preparation, data engineering and data analysis.
You end up with this distribution of knowledge, but as a byproduct, you end up with diminishing data awareness because you’ve gotten a decrease in centralized authority and increased distributed authority, but then you also end up with a decreased ability to centrally manage all the available resources and making everyone in the enterprise aware of what those are.
This is not an academic question. This actually becomes a complicated problem. We started out by noting that we’ve known for decades that business decisions are enabled through reporting analytics and business intelligence. That relies on data awareness, and data availability, and data accessibility, and trustworthiness, and freshness, and currency, et cetera, but when your data awareness is diminished, it introduces questions about the data that’s being used for your BI or your reporting.
Essentially, that impairs the data consumers from the best use of analytics. Instead of enabling those data consumers, the increase in complexity and distribution of data and the growing complex internetwork of data pipelines actually ends up reducing the effectiveness of business intelligence and reporting because it throttles the ability to enable those end-data consumers.
To enable informed decision-making and accelerate responsiveness, data analysts or data consumers have to be aware of what data sets are available, and what data sets can be used, and under what circumstances they can be used, and how is the data sourced, and how is the data transformed between the origination points, and when it’s been delivered into some kind of report, or some kind of visualization, or whether that’s been forwarded in a particular format to an analytical application.
What are the dependencies that exist across the organization with respect to these data sources? Essentially, asking these key questions that I’ve got on bullet points on this slide here. Essentially, can I trust the data that’s made available to me? How are those data sources, or how are reports impacted by changes in the environment, or change to a data source? The third is can I get the data that I need at the right time?
When you think about it, you actually can reflect on this a little bit differently, in that we can put these questions a little bit in a different way. When you ask the question in this way, it’s really thinking about it as if the end-consumer is not part of the process. In reality, when we start looking at how the end-users are integrated directly into those increasing number of data pipelines, you have to then turn these questions a little bit on their side and say, “Wait a second. It’s not necessarily about can I trust the data in the warehouse, but rather, how can I examine the data pipelines that are feeding the data into my application to allow me to feel confident that I can trust the data that’s being made available to me?”
The second question, it’s, “Under different what-if scenarios, how can I determine how the different reports that I look at are going to be impacted if there was some change to the source data, or if there was some change to the data model of the source? Or even if there was some change to the circumstances under which the data in the source is collected, or some change to a policy that makes a modification to the way data is being collected and then subsequently propagated?”
The third question, “Am I getting the data that I need at the right time?” A different way of looking at that is, “What are the best methods for optimizing the data pipelines in the environment to ensure that I get the data that’s most current and most trustworthy in the right timeframe, but that it doesn’t diminish any other analyst’s ability to get access to the data that they need at the right time?”
“Are there ways that we can look at how the different data pipelines are configured to be able to facilitate enablement of data delivery in a way that is trustworthy, is not impacted in ways that we don’t understand, and can be optimized so that everybody in the organization gets what they need at the right time in the right way?” Essentially also, provided with the right degree of exposure so that there’s protections and those types of things.
From a logical perspective, what we’re looking at in terms of informed decision-making, is that we require data awareness, and that data awareness is not just being able to look at a simple enumeration of data sets and a list of the data elements that are in those data sets, but rather having increased visibility into how the data elements that are being made available to me in the formats that they’re being made available, how are they produced? It’s no longer just a question of metadata.
In the past, if you came and asked these questions 10 years ago, the answer would’ve been, “Well, you need a metadata repository,” but in fact, a metadata repository is only part of the answer. We look a Data Lineage, Data Lineage is really the answer here. Data Lineage methods are intended to help develop a usable map of how information flows across the enterprise. These methods help map out the landscape, provides a holistic description of each data object that exists within the organization that is being used as a source at some point to any of the downstream consumers’ artifacts.
The data object sources, the pipelines through which data is flowed, the transformations that are being applied to data elements or combination of data elements along the way, the methods of access of those data objects and the data elements within those, the controls that are imposed. Basically, any other fundamental aspect of user information utility. Data Lineage is combining different aspects of corporate metadata. The first one would be the production lineage, which is the semantic aspect of tracing how data values are produced.
An example of this is– This is an interesting example, I think, because people don’t usually think about a report as being an object that is subject to metadata, but when you look at a report and you see that there is a field on a report, and there’s a column, and then there’s a value, these are not persistent typically. These are produced values, but they’re still data elements.
Those data elements are at the end of that production line. There’s an actual semantic interpretation of what that data element is that is based on how it was produced. If you are able to get some visibility into the production change for that particular data element, that gives you insight as to what is actually being represented in that component of that report, or that bar chart in the visualization, or that result of some kind of analysis.
The second type of Data Lineage or metadata that’s associated with Data Lineage is the technical lineage which is the structural aspects of that data element as they are produced, consumed, extracted, propagated across the enterprise. It’s not just the flow, it’s also what happens to that data element as it moved along the way.
The third piece of this is what we would call the Procedural Data Lineage, which is a trace of the journey through the different systems and the data stores that gives you essentially an audit trail. If you looked at this audit trail that changes along the way, it would give you visibility to be able to answer the types of questions that we’re asking in the prior slide, which is, “Can I trust the data in the data warehouse?”
If I can see what the meaning is that is based on the composition of that data element through the way progressed from its original source or sources and transformations to the delivery point, then I can get a level of assurance that I can trust the data in the data warehouse. The second question being, “How’s a report impacted by a change in the data source?” I’d like to be able to trace what the relationships are and the dependencies are from the data source to the points in the different reports where that data is essentially employed or used.
Then, am I getting the data that I need at the right time if I’ve got this visibility into the procedural lineage? I can look at whether there are any inadvertent delays introduced into the way that those data values are being produced. That will let me know whether there’s any potential for the introduction of a delay that could impact data freshness or data currency within that process. Lineage actually gives you visibility in multiple ways, and yet needs to be addressed and across multiple dimensions. We’re going to look at some perspectives on how Data Lineage has changed over time.
It’s interesting. I reflect back on some of the work that I’ve done 20 years ago with respect to data quality that looked at being able to trace lineage of the process flow of the production of data that went into an end-user report and end-user analysis. The issue though was, 20 years ago, if you wanted to have Data Lineage, it had to be manual. You had to manually walk through your processes, and documents the metadata, and document the dependencies, and manually manage that.
What was emergent relatively recently in what I would call the second generation Data Lineage tools, was simple automation for being able to do things like creating a data inventory or harvesting metadata and then inferring some system-to-system dependencies, which really tells you at a very global level which systems touched which data elements in which data sets. You can tell that a system read a value or wrote a value, and even be able to give some visual representation and a bridge or an interface to a data catalog.
It gets you part of the way there. This tells you a little bit about the data awareness or the data availability. It does give you some potential for inferring dependencies on the data elements, but it doesn’t really give you the depth that is necessary to be able to get the level of insight that’s necessary to be able to answer those questions the right way.
The emerging– I’ll call the bleeding edge tools. I would group what Octopai’s briefed me on, the Data Lineage XD. These are essentially the things that they’ve incorporated into their new release. I’m sure David Bitton is going to elaborate on greater detail, but Lineage data elements across different systems transformations from source to target or between system columnar dependencies, or basically cross-system lineage.
It shows how data flows from the origination point through the data pipelines to the different reports and analyses that are delivered to the data consumers. It provides column lineage that shows transformations that are applied to data elements from the source to the delivery point. Then, intersystem lineage of the documents, details of the production of data elements within the specific system contexts. I’m sure we’re going to get some more details on that when I hand it over to David.
To cycle back, we talked a little bit before about a level of data architecture complexity and how there’s a need for automation. If we’re trying to do this in a manual way, essentially it is undoable manually. Organizations are continuing to expand their data landscape across on-prem and cloud platforms. The complexity of these data strategies means that manual oversight and manual management of Data Lineage is going to be difficult, if not impossible.
Organizations are going to need tools that automatically infer, and capture, and manage, and provide a visual presentation that that is intuitive to the data consumers that provides details into this multidimensional Data Lineage. The implications are here is that manual capture is difficult, it’s time-consuming, and worse, it’s error-prone. Automated capture in management of the lineage is going to provide trustworthy details about the data origin, the transformations and those dependencies across the enterprise.
We’ll come back to the theme of today’s talk, Data Lineage accelerates BI responsiveness because it informs these processes and requirements, things for integrated auditing for regulatory compliance, or impact analysis to assess how code or model modifications impact data pipelines, or replication of data pipeline segments for optimization, or root-cause analysis. Access to these different dimensions allows data consumers to know what report data elements are available, and how are they produced, and what dependencies exist on the original sources, and what transformations were applied.
We can look at an example use case. This is timely and relevant. We’ve got these data privacy laws that are intended to prevent against exposure of sensitive data. That’s typically engineered into a bunch of different applications. Here’s the issue. If you look at some of these laws, the laws are changed over time, where perhaps the definition of private data is expanded to include a data element that previously had not been included.
If you’ve got this constraint where you’ve got multiple systems that are depending on analyzing data, and all of a sudden, the law changes that says some data element in the source is no longer available unless you have a particular right to view that data, how would I know where in the environment I need to make a change? What processes are impacted? What reports are impacted? What systems or what code needs to be reviewed and updated? Data Lineage provides this visibility for doing impact analysis.
The cross-system lineage allows you to identify which systems are impacted by modification to that externally defined policy. It’ll tell you when you modify the use or the rules associated with a particular source data element what systems are touching that data element and how are they touching it so you can determine– If you look at the column lineage, it’ll show you what the right dependencies need to be reviewed. The inner system lineage will expose where there’s internal data dependencies that might inadvertently expose that data that now has been modified to be incorporated into that definition of private data. This is just an example of a use case. There’s multiple use cases about how Data Lineage can be used in different ways.
What do we want to look for when we’re looking for a Data Lineage solution? Well, you have to look at this from the practical perspective. Again, Data Lineage is a tool. It’s used in different ways by different processes. I gave you one example. There’s a bunch of other examples for the data analyst that Data Lineage can provide insight into semantics and meanings of data elements that are available for developing report or producing some kind of analysis for a data engineer.
Data Lineage gives you details about the pipelines and cross-system dependencies. A BI developer might rely on Lineage to track down issues that are affecting the development of an artifact. A data scientist might want to examine different methods that other data scientists use to prepare data for their analyses, to see if there’s any opportunities for replication that would speed up the result. An application system developer might want to see how changes to policies or models need to be addressed across the enterprise.
You need to understand how to look for the right capabilities that will address all these different use cases for all these different types of consumers. I’ve boiled it down to these four categories. When you want to look for a Data Lineage tool, consider these four different facets. Breadth. You want to be able to have details about the breadth of how information flows across your enterprise.
When lineage is limited to a system-to-system data flow that doesn’t show the finer details about intercolumn dependencies or what transformations are being applied during each processing stage, that’s not going to be able to satisfy the needs of the different personas that we’ve just talked about. Look for tools that give you this description of the lineage across those different dimensions.
Going clockwise, automation. Again, I don’t think I can emphasize this more. Any attempt to do this manually is doomed to be error-prone. If you rely on manual capture and management, that’s going to be time-consuming, it’s going to be error-prone. Automation is going to remediate these issues. The third is visualization. You’ve got to have an intuitive method for providing the right level of detail and visualization to each type of persona. Especially as the number of data pipelines increases and the complexity of those pipelines grow. If you recall, if we look at the continually evolving complexity of our data landscapes, you’ll see that that relying on something that is not giving you the right level of detail depending on who the consumer is, is going to impact their ability to make good decisions about how to address their particular use cases.
Finally, integration. Again, data lineage is a tool, but the tool among an arsenal of other tools, and data lineage tools need to integrate with these other tools and utilities, especially if you want to be able to automatically derive the lineage information. Look for products that are engineered to integrate with other complementary products. With that, if you’ve got questions, I think Shannon’s already told us about how to share questions. If you think of one after the fact, you can contact me at either my Knowledge Integrity email or my University of Maryland email. I’m going to hand it back to Shannon to introduce David.
Shannon: Thank you, David Loshin. David, if you want to share your screen, just start your side of the webinar. If you have questions for either David, you may submit them in the Q&A section, which you can find in the bottom middle of your screen for that icon and we will get to the Q&A at the end of the presentation. David, take it away.
David Bitton: Thank you, Shannon. Thank you, David, for an interesting presentation. I’m very excited to be here today and to be able to actually introduce you to Octopai’s data lineage XD, which is actually the first platform on the market to provide advanced multi-dimensional views of data lineage. What is multidimensional views of lineage? First of all, we have cross-system lineage which provides you end-to-end lineage at the system level from the entry point into the BI landscape, all the way to the reporting and analytics.
This level provides high-level visibility into the data flow, and it also maps where data is coming from and where it’s going. Secondly, we have inner system lineage, which details the column level within an ETL process, for example, report, a database object, and so on. Understanding the logic and data flow from each column provides visibility at the column level, so no matter how complex the process, report, or object is.
Then there’s finally end-to-end column lineage which details the column to column-level lineage between systems from the entry point into the BI landscape and all the way through to the reporting and analytics. Now what I’d like to do is jump into a demo and show you the power of Data Lineage XD with an actual use case. Bear with me, we should be able to jump into the Octopai demo environment.
What I’d like to do is, again, show you this in a use case. Imagine now that you have a support ticket that was issued by a business user. It could be the CFO, Mr. or Mrs. CFO, and let’s say it’s the end of a quarter and the report that they’re basing their business, the quarter of the results on, there’s something wrong with it, which is, of course, a common scenario that I’m sure many of you are familiar with.
In order to try to figure out what was wrong with that report, you’re going to need to understand how the data landed on it. In order to do that, it’s going to require reverse engineering. Of course, that’s going to be done in most organizations today that are not using Octopai. It will be done with a lot of manual work, which is going to be very time-consuming. As David mentioned, inefficient, and it will also introduce some other production issues as well.
Now, this would not be the case with Octopai. Let’s go ahead and see how Octopai would address that challenge. Octopai will search through all of your various systems in order to gather the metadata that we need. Now what we see here on our dashboard is actually the Octopai dashboard. On the left-hand side, we see here, a sampling in our demo environment of some different ETLs from SSIs, and also from SQL service store procedures.
In the middle, what we see here are of course the different database objects, tables, and views and that is, of course, coming from SSIs, SQL server and as well, some textual files. To the right of that, we see the different reports and the reporting systems. Now, in order to investigate this error in this report, most BI teams will go through a very similar scenario, which is they’ll probably start off by investigating the structure of the report and the reporting system.
After that, everything will need to be mapped and then they’ll probably need to contact a DBA to ask them questions such as which tables and views were involved in the creation of that report if they don’t know themselves. Now they also might go in and take a look at the fields or labels to see if they were given the same names and if not, which glossary was used. Now, even after investigating everything at this level, which is, of course, the most common or make sense because the error is here. We’re going to take a step back to see first if the error crept in there.
Now, even after investigating everything here, our DBA may be kind enough to tell there’s actually nothing wrong at this level. You may need to look in at the ETL level. You’re going to need to take a step backwards, start investigating, of course, that at that level. Of course, it’s going to be a very similar process. Now, in most organizations in order to do that kind of investigation, if you’re lucky, it may take an hour or two, if it’s a very simple scenario. If it’s more complicated than that, then that it may take a day or two.
We even have scenarios where customers are telling us it sometimes take weeks and even months, depending, of course, on the complexity. That’s a fair synopsis of how that would be handled in most organizations today. What I’d like to do now is actually show you how that would be addressed within Octopai literally in seconds, and automatically. The trouble we’re having with is in a report called customer product.
I’m going to type that in, in our lineage module. As I type that in, Octopai will filter through the entire environment, showing me the report that we’re having trouble with. Now, what I’m going to start off is actually by showing you a cross-system lineage. I’m simply going to click on that and about a second later, we have a complete understanding at the cross-system level of how the data landed on that report. I’ve increased the legend on the left-hand side, you can actually see what we’re taking a look at.
On the right-hand side, over here, what we have is the actual report we’re having trouble with that. Our CFO complained about. As I move to my left, I can now see how that report was graded. We can see here that there was at least one view involved in the creation of that report. As I continue to move to my left, you see here, there’s another view and a few different tables that were also involved in the creation of that report. Now if I click on any item on the screen, I get a radio dial that comes up, which gives me more options.
Now just I guess, for argument’s sake, or just to give you another example of how we can help is if you needed to make a change to this table and wanted to know what the impact would be imagine what you would have to do today, in order to get that information. With Octopai, it’s simply clicking on the button there. We now see the dependent objects, of course, at a high level, what would be impacted, should we make changes to that one table. Of course, that would be the same to any object on the screen.
As we continue to move left, we find the ETL that was involved in the creation of that report. In this demo environment there’s one ETL and many organizations that we’re dealing with are actually using multiple different systems to manage and move their data. If that’s the case, in your organization, it’s not a challenge for Octopai, we can still show you the path that the data has taken in order to get or land on that report. Now, as we pushed our customer further, in the scenario, we actually asked what went wrong with this report.
They admitted that a few weeks earlier before they had started using Octopai they had made changes to this one ETL over here, and most likely, that’s why they’re facing production issues today. We asked them, of course, if they knew that, whenever they make a change, there would be impact, why not be proactive, look into the impact that those changes would have on the environment or on the system, on the data pipeline, and of course, make the corrections and save all of the production issues, the data quality issues that arrive and the resulting confidence in the data and so on.
Of course, as David said, it’s basically undoable in most organizations, because there’s just too much to look into. There could be hundreds, if not thousands, or even tens, of thousands of different objects that could be affected different ETLs, tables, views, and reports, and so on, that could be affected by any one change to anyone object in the environment, such as a table, view, or an ETL.
Of course, since organizations are forced to be reactive, that’s because the only way that they can work. Of course, they will try to make changes without making any production issues using the capabilities at hand, such as the knowledge of the people on the team, if they’re all still there, and they haven’t left the organization, maybe some spreadsheets, hopefully, they’re kept up to date, if not, they’ll deal with it.
Then, of course, using all of that, they’ll make those changes, maybe holding their fingers crossed and a little bit of a prayer. Then 8 or 9 times out of 10, there will probably be no production issues, and then the 1 or 2 times out of 10 that there are production issues because they’re forced to react, they will have to actually react to those issues. The problem with that is you’re only reacting to what you know of. What you don’t know, of course, continues to snowball and create all kinds of havoc throughout the environment.
Now, of course with Octopai, we can change that. We can turn that on its head. We can now empower you to become more efficient, and proactive. Actually, before you make a change, you can actually now understand what would be impacted should you make a change within the environment. Let’s say like this customer, however, we were using Octopai, we needed to make a change to this ETL, simply click of the mouse and click on lineage, we now understand exactly what would be impacted should we make changes to that one ETL.
What we see here is something quite interesting because if you guys remember the reason why we started this entire search or root cause or impact analysis search is because of course, we had one business user complain, although it was Mr. or Mrs. CFO, we had them complain about one report.
As far as we knew, of course, ignorance is bliss. That was the only report affected. Now, however, that we take a look at the lineage of this ETL, we can actually probably be 100% certain that most likely that is not going to be the end of the scenario. Most likely that some if not all of these different objects on the screen could have been affected or would have been affected by any change to this one ETL.
Of course, these different ETLs, store procedures, views, of course. Sorry, There we go, views. Then, of course, these measure groups, dimensions, tables, and views, and of course, reports could have been affected. Most likely, what will then continue to happen in real reality is that, as these reports get opened, you hope that those business users that are going to be opening those reports are going to actually notice the errors in them because if they don’t, it’s going to be worse.
Now, they’re going to actually open support tickets. Now, these reports will be open throughout the year by different people within the organization with different job functions, of course. Of course, since they’re open throughout the year by different people at different times, and those support tickets are open, at different times of the year, there’s just no way humanly possible that those who are responsible for trying to fix those errors, could know that there is one root cause.
What will happen is, as we said earlier, they’re going to start to reverse engineer those reports, which could take anywhere from hours, days, or even longer. You can probably know better than I how much time and effort is wasted throughout the year trying to reverse engineering those reports, because of course, it’s not going to be limited to 6 or 8 or 10. It’s probably going to be hundreds. Now, of course, I said wasted because if they had known, the BI team or those responsible for correcting had known from the get-go that this ETL was a root cause and they wouldn’t have had to reverse engineer all of those reports to try to get to that root cause.
Now I left these two here on the side and that is to prove a point and that is if you’re working reactively manually, most likely, you will get to most of the errors and most of the reports or in the system, but not all. Some of these reports will fall through the cracks. They will continue to be used by the organization. Then, of course, the organization will make business decisions based on those reports which is going to be, of course, the most impactful of the two.
So far, what we shown you up till now is a root cause analysis. Then we jumped into an impact analysis, of course at the cross-system level. Now what I’m going to do is jump into the inner system to show you inner system lineage. Let’s take a look at this SSIs package. Maybe you need to make a change to this ETL and you want to see the impact at the column level.
Let’s go ahead and click on the SSIs package and choose Package View. Here we see, of course, one package as it’s a demo environment. Of course, your production environment, if you’re using SSIs most likely will have multiple packages and of course, you would see them here. Now let’s delve into the container. By double-clicking here I’m going to delve into the container. Let’s take a look at the logic and transformations that take place within one of these processes.
I’m going to take dim product, double click it and now it will take me to a column to column level at in the inter-system level. What we can see here is a source to target. Now, let’s say if you’d like to see the entire journey from source to target, including the transformations and logic, of course, that happened within a specific column, you simply now choose the column that you’d like to get to. You might not be able to see but there are three dots that pop up onto the right of that.
If I click on that, that will now take us to end to end column to column lineage. Now, what you’d like to see is the entire journey, what you’re seeing here is from source to target, all at the column level. Now, we can also show you as well, from the column over, we can jump into the table level, schema level, and DB level. Now, of course, all of this is integrated. If you need to jump further into any one of these objects on the screen, for example, if I needed to go backwards now to cross-system lineage, I can black and now go back into the cross-system lineage.
That was everything that I had to show you here today. Of course, there are other dimensions to Octopai’s platform that I haven’t shown you here today. We have, of course, data discovery. We have a business catalog, which is actually called an automated business catalog or business glossary ABG. There may be, of course, other questions. If you’d like to see more about Octopai, of course, you can get in touch with us. Of course, we’d be happy to arrange a more in-depth demo and presentation. Back to you, Shannon.
Shannon: Thank you so much for this great demo and information and thanks to both of you for this great presentation. Again, if you have questions for either David, feel free to submit it in the Q&A section of your screen. To find the Q&A panel, just click that in the bottom middle and answer the most commonly asked questions. Just to note, I will send a follow-up email by end of day Thursday Pacific Time, with links to the slides and links to the recording of this presentation.
If you see in the Q&A section that somebody already asked the question that you like, just hit that little thumbs up button to escalate that and to dive in here. This came in, David Loshin, when you were talking, with the same information being used across the enterprise for various purposes, would you address any ethical implications that may be overlooked or not considered?
David Loshin: I’m not really 100% Sure, I understand what you mean by ethical considerations, although I do think that an example might be the determination that there is an unauthorized approach used to combine data from multiple origination points that results in exposing information that probably should not be exposed. That might be an example where we can automate the inferencing of characteristics associated with say, a customer based on data that’s being pulled from multiple sources in a way that it shouldn’t be used. I would assume that would be a good example of a use case for lineage because you’re able to see how datasets are being blended and being fused for downstream use. If you want to go back to the Q&A and type into clarification, maybe we can cycle back on that question.
Shannon: Sure.
David Bitton: Shannon.
Shannon: Yes. Can you hear me?
David Bitton: I see a question here that I wanted to answer if you don’t mind.
Shannon: Okay.
David Bitton: I see a question here by one of the attendees. It says I see, the demo uses Microsoft tools to find tools for end-to-end lineage. What other reporting or other tools do we support? If you don’t mind, I just like to answer that question. Octopai has actually the most extensive list of supported systems, not just, of course, Microsoft. What you can see here is currently what we do support plus what is in our roadmap.
You can simply find that on our website under octopai.com supported technologies. To give you an example, there’s ADF, Azure Data Factory, Netezza, Teradata, SQL. of course, Amazon Redshift is on the way, Vertica, Power BI, Qlik, MicroStrategy Cognos, and of course, we have many more coming. Sorry about that. Sorry, to interrupt. Shannon, go ahead.
Shannon: Sure. Lots of good questions coming in here about Octopai. In fact speaking of does Octopai work within SAP to collect lineage information?
David Bitton: Within SAP we do not collect lineage. However, we do support SAPBO as a reporting systems and we can provide lineage to and from it.
Shannon: What is Octopai’s Enterprise Pricing?
David Bitton: That is a question that would be a little bit more difficult to answer in a forum like this but I can give you an understanding of how its priced. It’s certainly not by user. Everybody within the organization can have access to Octopai and gain benefit to using Octopai or from using Octopai and that includes also our Business Glossary. Everyone on the business side can also have access to it.
The way we do price Octopai is like I said, not through user it is by module, depending on the module today. I showed you one module. There are other modules within Octopai and metadata source. Ballpark is anywhere from around $3,000 to $10,000 per month total. All in there are no limitations basically, on anything and it includes training, upgrades, maintenance, and so on. That is of course in an annual license or an annual contract.
Shannon: What was the initial information created in Octopai?
David Bitton: I’m sorry, repeat the question again.
Shannon: What was the initial information created in your demo there?
David Bitton: I’m not sure I understand the question.
David Loshin: He was asking how did it bootstrap the collection of information? That’s how I’m inferring.
David Bitton: I might have understood that as well so how do we collect the metadata, so it’s very simply done. There is an Octopai client that we send to the customer. The customer installs that once in their environment on any Windows system. They point that client to the various systems that we want to extract the metadata from. Of course, we provide you with all the instructions on how to do that. That entire process on configuring the Octopai client to extract the metadata should take no more than one hour. It’s done once.
Once that’s configured, you hit the Run button, Octopai then goes ahead and extracts that metadata, saves it in XML format. Those XML files are saved locally. You can, of course, inspect them to take a look at them to ensure that they meet your security standards before you then upload those to the cloud wherein then the Octopai services is triggered. That’s where all the magic happens, the algorithms, the machine learning, and the process power come to play to analyze that metadata and then make it available to you in the form of lineage, discovery, and even a business glossary to the business user via a web browser.
Hopefully, that answered the question.
Shannon: Yes, I believe so. There’s a follow-up to that, not only how is it initially, but how does it keep up to date?
David Bitton: Actually, a great point. That’s something that I forgot to mention is that it could be an entire process that I just mentioned can then be automated so that on a weekly basis, you upload new metadata to the cloud. It’s analyzed and given to you so that you can see on a Monday morning, for example, that you upload that on Friday, Monday morning you come back to work and you have a new version. That works actually quite well with most organizations because development usually happens during the week, then uploaded to production, and so Monday morning you have a new fresh version.
Shannon: While we’re on that topic, how do you handle lineage with software as a service apps? Often we’re leveraging extracts or APIs to access data from those apps.
David Bitton: That’s a good question. Well, I know that we don’t support APIs, but I think I’m going to refer that response to Amichai, my colleague, who is on the line. Amichai?
Amichai: Sure. We have different methods of extracting metadata from all different types of sources. If in any case, the specific type of source is not supported for automation, there’s always an option to augment different types of lineage for anything you have that wouldn’t be supported.
Shannon: Perfect. Love it. How can a tool like this be operationalized to work in an enterprise system like an EMR?
David Bitton: Amichai, I think that’s what’s for you.
Amichai: Can you repeat that again?
Shannon: Sure. Yes. How can the Data Lineage XD be operationalized to work in an enterprise system like an EMR?
Amichai: Again, all the tools that we support with automation is what David showed and is available on our website. In addition to that, there’s always the option of augmenting additional lineage to get the complete full coverage if you have anything that’s not supported.
David Bitton: I think here someone may have misunderstood me. Akim Perdun had asked the question, no API integration. Yes, we do have APIs that can be called upon. If you need to export everything or anything within Octopai, you can use those APIs to be called upon and they can inject that metadata and the lineage into a third-party application. We also have direct integration with some other industry systems as well.
Shannon: Awesome. It is clear that there is a need for a comprehensive data lineage tool to learn and understand the semantics, structure, and process. How do you integrate the tool to the ecosystem and other products and utilities?
David Bitton: Okay. Amichai, once again, I’m going to refer that one to you.
Shannon: Oh, they’re muted.
Amichai: Yes.
David Bitton: Sorry about that.
Amichai: The way our tool integrates is basically the way David described before is that we actually connect to the different tools in the BI ecosystem. We pull metadata from those tools in an automated way. Once we do that, we’re completely away from the ecosystem. We do perform all the analysis that David talked about on the side and make it available via the URL.
Shannon: I love all these questions about the product. Lots of interest here. How does Octopai help in a distributed environment where data sets are extracted and used locally?
Amichai: Again, it really depends on the type of implementation. You would make use of all the different methods that we’ve been discussing so far. It really just depends on the different type of environment that you have.
Shannon: There’s lot of questions here about data catalogs. Do you connect to any other data catalogs, Collibra, IBM, any others?
David Bitton: We have direct integration with Collibra. I see here a question, and as I mentioned earlier, we have APIs that can be called upon to integrate with others. We also are in talks with others to integrate directly with them. There was a question about what the data lineage that Collibra delivers and what Octopai does. The question is the use case. If data governance use cases are the use cases that you’re concerned with, then, of course, Collibra would be well suited for that. If they are the use cases that would be involved in a BI landscape such as impact analysis, reverse engineering report, and the various other scenarios that are within the BI, of course, Octopai would be more suited for that.
Shannon: Augmented data management is a concept catching up with clients, do you see Octopai catering to that market as well?
David Bitton: I’m not familiar with that. I think Amichai might know a little bit more about that, but maybe you want to attempt that, Amichai?
Amichai: Again, I missed the first word. What was that?
David Bitton: Augmented.
Shannon: Augmented. Yes.
Amichai: Augmented data management?
David Bitton: Yes.
Shannon: Yes. Augmented data management is a concept catching up with clients. Do you see Octopai catering to that market as well?
Amichai: Yes. Well, they complete each other in a way and Octopai as well is, as I mentioned before, allows you to augment lineage, but while you really enjoy the big benefit of the automation whenever it’s needed. That’s also part of what we offer.
Shannon: There’s a lot of questions here about what other products you connect with and how you connect. Is there a link that we can get that I can send in the follow-up email that shows all the technologies.
David Bitton: Yes. If you can see here on my screen, that is a link to the currently supported systems. If you’d like, I can send that to you afterwards, it’s octopai.com/supportedtechnologies/.
Shannon: Love it. Would you mind putting that in the chat for us, please? Then, I’ll copy that over and include that in. I love it. I think we’ve got time for a couple of more questions through here. At least one more. Is it possible to include metadata from other governance tools into lineage? For example, if ClusterSeven inventory of reports is described with lists of sources.
David Bitton: Sorry, run that question by me again
Shannon: Sure. The core of it is, is it possible to include metadata from other governance tools into the lineage?
David Bitton: That is a good question. It all depends. Currently today Octopai supports the systems that I mentioned earlier. We do have augmented links which we can actually add for systems that are not supported on the list there. Amichai, did you have any other options possibly for that question there?
Amichai: Yes. It, of course, depends on what type of metadata you have in those different tools, but yes, there is an option to import different data assets from different tools into Octopai.
Shannon: All right. I think we do have time for one more. How might this be used to track data-flows from discreet internet of things devices across multiple source channels?
Amichai: All right. The way Octopai works is we basically connect to the different data pipelines and different data elements, the data warehouse, the reporting tools, and all of those, and we will pick up the metadata from those places directly. That’s the way we harvest the metadata and build the entire lineage.
Shannon: I love it. Well, that does bring us to the top of the hour. I’m afraid that is all the time we have. So many great questions about the product and interest. Again, just a reminder. I will send a follow-up email to all registrants by end of day, Thursday, Pacific Time with links to the slides and links to the recording as well as the additional information requested here. Thank you to everybody for all these great presentations and information. Thanks to Octopai for sponsoring today’s webinar and helping making all these happen, and thanks to everybody who’s been so engaged. We really appreciate it and hope you all have a great day. Thanks, everybody.
Amichai: Thank you, everyone.
David Bitton: Thanks.
David Loshin: Good day. Bye-bye.
Shannon: Thanks, guys.