Video Transcript
Jim Powell: Hello, everyone. Welcome to the TDWI webinar program. I’m Jim Powell, editorial director at TDWI, and I’ll be your moderator. For today’s program, we’re going to talk about Building Trust in Data with Advanced Data Lineage. Our sponsor is Octopai.
For our presentations today we’ll first hear from James Kobielus with TDWI. After Jim speaks, we have a presentation from David Bitton with Octopai. Before I turn over the time to our speakers, I’d like to go over a few basics.
Today’s webinar will be about an hour long. At the end of their presentations, our speakers will host a question and answer period. If at any time during these presentations you’d like to submit a question, just use the Ask a Question area on your screen to type it in.
If you have any technical difficulties during the webinar, click on the Help area located below the slide window, and you’ll receive technical assistance.
If you’d like to discuss this webinar on Twitter with fellow attendees, just include the #TDWI in your tweets.
Finally, if you’d like a copy of today’s presentation, use the Click Here for A PDF line there on the left middle of your console.
In addition, we are recording today’s event, and we’ll be emailing you a link to the archive version so you can view the presentation again later if you choose, or share it with a colleague.
Okay, again, today we’re going to be talking about Building Trust in Data with Advanced Data Lineage. Our first speaker today is James Kobielus, senior director of research for data management at TDWI.
Jim is a veteran thought leader, industry analyst, consultant, author, and speaker on analytics and data management. Over the past three decades, Jim has held analyst positions at Futurum Research, Wikibon, Forrester Research, Current Analysis, and the Burton Group. He also served as senior program director, product marketing for big data analytics for IBM, where he was both a subject matter expert and a strategist- excuse me, on thought leadership and content marketing programs targeted at the data science community.
At TDWI, Jim focuses on data management, which encompasses database platforms, data governance, data integration, master data management, data ops pipelines, and much more. Jim, I’ll turn things over to you now.
James Kobielus: Thank you, Jim. Hello, everybody. It’s a pleasure to be speaking with you today. Before I get into the core of my presentation, we’d like to do a poll. We’d like to ask the audience to interact with us here using the web interface and tell us what are your biggest data challenges. The options are, implementing new business processes, fixing processes that are not working or making process improvements, mergers and acquisitions, migrations and upgrades, impact analysis, fixing reporting errors or lack of trusted reporting, regulations and compliance issues, and reducing operational costs.
You can choose two or more of these options and click the submit button on the lower right and let’s see your responses. We’ll wait a few seconds here for the audience to, here we go, begin to click in with their responses. Wait a few more seconds here to see the updates.
Oh, it’s clear from the initial responses that fixing processes that are not working or making process improvements is a big challenge for data management professionals. In fact, in many ways lineage and impact analysis are core processes that we at TDWI have seen that many practitioners say they need to fix, they need to have a high-quality lineage and impact analysis in order to manage a variety of functions from ETL through BI and so forth.
Reducing operational costs, fixing reporting errors, and you can see the other responses on-screen that very much tracks with the research that we’ve done looking at the challenges for data management professionals and analytics professionals.
Okay, now we’ll move on to the core of the presentation. What I will talk about today, we will talk about, and after my presentation, I’ll hand it back to Jim so that Octopai’s David Bitton can present their demonstration and their remarks. What we’ll talk about today are as follows, how to illuminate data assets full lineage and downstream impacts, how to build trust and data through automation of lineage and impact analysis, and how to use advanced data lineage tools to augment the productivity of your data management and analytics professionals.
Well, for starters, and really the core thing that you need to know is that trusting any piece of data requires a full visibility into its entire life story, that piece of data. Deep and comprehensive visibility of the sort that data lineage and impact analysis tools provide. Data doesn’t live in isolation from other pieces of data or from your business processes or the challenges you face, but in fact, the data itself is a key input into them all. Being able to trust that data requires that you look at data and have visibility into really the many dimensions of the data that you’re using inside of your analytic-driven applications.
Visibility into the data’s provenance. How was that data originated, and what applications, for what purposes? How has that data been handled and moved and transformed and cleansed and augmented and so forth from the point at which it was created up to the very present moment? Providence is a key dimension of trust in data.
That data’s correctness. Has the data been kept accurate and updated at every point in time for every use, to the extent that the data has been allowed to lapse into error at any point in its life cycle? Then decisions have been taken perhaps based on those errors and those can have adverse impacts on business.
Completeness. Has the data been aggregated, packaged, and delivered, really with all the other relevant content for every decision at every point in time for every use? Clearly, if you’re working from partial fragmented data at any point in time and basing your business on that, though individual data elements might be accurate and up to date, the sum total of the data presented to any decision-maker at any point in time might be so fragmented that they can easily go down the wrong path in terms of the actions they take based on the data that is presented to them at that time. Completeness is very important for trusting data.
How consistent is that data been? Have all copies of that same data been kept in sync at every point in time and for every use?
Compliance. Has that data’s format, content, handling, and management been kept in compliance and conformance with all relevant standards, mandates, and requirements?
Clarity. Is the data’s meaning clear for every use and every user? Are the data’s meaning understandable to the users in terms of the vocabulary or glossary that are used or are accepted within particular business functions and uses?
The data’s impact. How readily can the consequences of modifications to the data be discerned in downstream uses? In other words, data has impacts on business. In order to trust data, you need to understand fully to the extent to which changes and then its content, its format, and handling, and so forth how they impact business uses.
Trust requires visibility into a broad range of dimensions.
In TDWI’s research, we do primary research with data and analytics professionals. These points have been borne out amply over time over really every study we do. What we find is that data and analytics professionals say that having trust in the data’s quality and completeness and consistency and so forth is absolutely essential. It’s the most important thing in improving visibility into its providence, correctness, and so forth, is central to data’s value in business.
In order to maintain that level of trust, more of these functions in terms of data processing, the entire data pipeline from source to consumption, need to be automated and accelerated. Really, the data lineage and impact analysis has a critical set of capabilities to ensure trust needs to be brought more completely into the management and stewardship of that data over its lifecycle.
Advanced data lineage, that’s a broad range of capabilities and tooling. It builds on metadata management, data about data. You also need a semantic layer in a business glossary within your data lineage practice in order to ensure that the data is presented to end-users in terms that are meaningful to the actual uses of the data. You need a data catalog for a high-quality lineage and impact analysis in order to have complete visibility into your entire data estate across your distributed organization, and a unified data model with interactive visual mapping and embedded AI to be able to find the patterns and anomalies in the data in your organization.
Self-service user experience is very important for the actual users of data lineage tooling so they can have the visualizations and charts and analytics presented to them with the ability to interact with it so they can visualize the entire narrative of any data elements, entire history from source to consumption fairly completely. Automation of the data lineage process is very important, the vast amounts of data from myriad sources that are flooding in real time into all manner of downstream applications.
In order to stay on top of all that and all the metadata associated with it, you need tooling in your infrastructure, which is increasingly a cloud infrastructure that can crawl and index, tag, and map that data by lineage and impact across your entire ETL pipeline, across different data sets and really across a wide range of sources with little or no human intervention. That the data is just too huge and fast-moving and fast-changing for manual efforts alone, manual capabilities to suffice in terms of providing the visibility needed to manage it effectively.
The data lineage and impact analysis needs to be accelerated. To do so you need tooling that enables high-performance search visualization and analysis of data in all of its dimensions everywhere in your organization.
Augmentation as a key as a key capability of advanced data lineage. Meaning the tooling that you implement for data lineage and impact analysis needs to be able to improve the productivity of your BI and ETL and other professionals who manage the data in terms of providing them with contextual visualizations of data’s providence and processing and handling and movement everywhere, as well as a unified roll-up of the data within the context of a unified data model. In order to profile that data and its impacts across your many applications and downstream uses to help human beings who manage the data, who take responsibility for it, to do their jobs more effectively.
Advanced data lineage addresses diverse scenarios in business. First of all, cross-system lineage. You need tooling that enables you to visualize the impact of process changes in your organization on the impacted data elements that present themselves in your analytics applications in BI and so forth.
Column-to-column lineage capabilities enable you to visualize how data is propagated across the columns and link tables on systems that span your entire DataOps pipeline.
Another type of data lineage that’s very important is inner-system data lineage, which enables you to visualize the flow of data elements in a report within your ETL pipeline, or really within and across data objects running on a specific system.
Advanced data lineage enables you to deliver value to many stakeholders in your business. The data and analytics professionals who, for example, if you’re a BI developer, having advanced data lineage enables you to identify the root causes of why a particular dashboard.
Perhaps it’s financial dashboards displaying an anomalous value, at any point in time, if you can go back and trace where the data originated, how it was, what calculations were performed on it, how it was meshed and merged with other data elements, and so forth.
Why particular like your CFO is seeing a particular value at any point in time. Critically important for them to know, for them to be able to report up through the standard financial reporting process, but also for compliance purposes and so forth. Why specific values violate compliance mandates, lineage enables you to determine that.
Why specific fields don’t match across diverse reports, if you’re a typical organization, you might have hundreds of reports in your analytics environment, and covering the same or overlapping application domains. It’s quite possible that what appear to be like say customer fields for the same customer across different systems might not match at any point in time. Strong data lineage tooling enables you to determine why exactly they don’t match and then to fix the problem.
If you’re a data engineer, data lineage tooling enables you to determine the flow of ETL processes that have populated specific data downstream into diverse apps. That may be required to produce specific analytics for specific purposes, so build the ETL logic that feeds your analytics with the lifeblood data that they need. They make sure of high-quality data that’s consistent and so forth.
If you’re a data architect, advanced data lineage enables you to map cross-system dependencies across your different transactional analytic systems. You can determine the impact of changes to connected business processes, how those impact the data updates across scattered databases. Another key function of advanced data lineage for particular stakeholders to help them be successful in their jobs.
With that, I’m going to hand it back to Jim, and then he’s going to hand it over to Octopai to show what they’re doing in this area.
Jim: Thanks, Jim. Just a quick reminder to our audience, if you have a question, you can enter it at any time in the Ask A Question window. We’ll be answering audience questions in the final portion of our program.
Our next speaker is David Bitton, VP of global sales at Octopai. David has extensive product knowledge coupled with creative data– Excuse me, creative ideas for product applications, and a solid history of global sales and marketing management of software as a service, and internet-driven products. Hello, David.
David Bitton: Hello, and thank you for this opportunity to speak. I’m certainly excited to be able to present. Thank you, everyone, and thank you, Jim, for those thoughtful insights. What I’d like to do now is jump into a short presentation and then a demo, and talk a little bit about Octopai and how we can actually address these challenges in your day-to-day activities.
Before actually, we do that we’re going to have a short little poll. This is basically on how much time you’re spending in manual processes to find and/or understand your metadata or lineage for that matter. Whether it’s 1% to 19%, 20% to 49%, 50% to 75%, or above. Obviously, you can actually choose one there. I’m going to submit that for you guys to respond to that. I hit submit, I think you should’ve received it. [silence]
Okay, I think we’ll give it another minute or two. The answers that are coming in show us about half the day or more is being spent in manual processes. That is more in line with exactly what our customers are telling us. They’re actually spending 50%, 70%, 80%, even 90% of their days in manual activities, where, of course, those things can be automated.
I’m going to jump into the presentation part now. What I’d like to do is first just talk to you a little bit about Octopai. The reason why I’d like to do that is to show you where our companies come from, our founders. They actually have been there and done that, exactly what you guys are doing on a day-to-day basis.
The company was actually started a number of years ago by our founders. They actually came from BI landscape. Like I said, they were living and breathing the same challenges that most BI professionals are facing. They were actually doing that from different BI groups in different industries such as insurance, telecom, healthcare and others. They’re basically the same challenges that most are facing.
The challenges stem from complaints both data reliability. Similar to what James was talking about. If you don’t know where the data’s coming from, how can you ensure that it is reliable? There were other issues with accuracy or corp reports, and also, of course, all of these challenges were being from the business users.
Other issues related to changes. I think that was the main answer in the previous poll, that is whenever they would make a change to a specific process such as an ETL, it would unexpectedly impact a report, for example, or a field or a table down the line, or just trying to find and understand where all of their data was. Unfortunately, all of that work, like you guys have just answered in your poll, was being done manually, and is very time-consuming and challenging. See those are the challenges that led them to create Octopai. Now, although it sounds ambitious, is exactly what we’ve been able to do.
All right, so how do we do it? With Octopai all of that metadata, it’s so crucial for you guys to understand and so difficult to actually collect as was stated in the poll here, was actually collected by us and placed into a cross-platform SaaS solution. It’s done automatically. I want to stress the fact that it’s done automatically, meaning there are no manual processes involved such as documentation, there’s no prep work needed or preparations for that matter, customizations. We’re certainly not sending a bunch of highly-paid professional services to do that for you.
It’s none of that, it’s all being done manually by our platform, the metadata, once it is collected is centralized on that platform or in that platform, then it goes through a slew of processes such as being analyzed, modeled, parsed, cataloged, indexed, and so on. Then it’s actually ready for discovery so that you can easily find metadata or that crucial lineage that you need in seconds by the clicking a mouse.
Not only does Octopai provide those instant results, we also provide you with the best most accurate picture of your metadata at any given point in time. That’s because it’s constantly being updated on a weekly basis. That actually answers, I think, a question that was submitted earlier.
All right. What I’d like to do now is just talk a little bit about Octopai’s Data Lineage XD. Continuing with our pioneering leadership role, Octopai is actually the first introduced Data Lineage XD. The first solution on the market to provide advanced multi-dimensional views of data lineage. What does that mean?
First of all, the data lineage you may be receiving in some of the tools that are out there today is most likely going to be single, or maybe even dual dimensional, but certainly not XD or multi-dimensional like you’ll see here.
Number one is cross-system lineage. Cross-system lineage provides you N10 lineage at the system level from the entry point into the BI landscape all the way to the reporting and analytics. This level provides high-level visibility into the data flow, mapping where data is coming from and also from where it’s going.
Number two is the inner-system lineage, which details the column-level lineage within, for example, an ETL process or report or database objects. Understanding the logic and data flow for each column provides visibility at the column level no matter how complex the process report or object is.
Then finally, there’s the end-to-end column lineage, which details the column-to-column level lineage between systems from the entry point into the BI landscape all the way through to the reporting and analytics.
What I’d like to do now is jump into the demo, I’ll show you in a live demo how we can actually do that for you.
All right, so just give me one second, I should be able to share my screen, and you should be able in a second or two be able to see the Octopai platform. What I’d like to do now is demonstrate the power of Octopai’s Data Lineage XD with a use case or two. Of course, Octopai is a platform that has three, I guess, main features or main capabilities within, or modules within is actually the more appropriate term. These tools are there to help the BI professionals in their day-to-day activity and addressing the challenges that those are faced.
Number one is automated data lineage. Automated data lineage and discovery. Discovery of a specific, it could be objects within your environment, and then we have of course the automated data catalog. All three those are part of the landscape, are part of the platform, you can see them here on the left-hand side.
What we see here on our screen is the Octopai’s dashboard. On the left-hand side, we can see here an example of the various different systems that we have in our demo environment containing the different ETL systems and the ETLs themselves, 397 of them. Then in the middle, what we see here we have the various database, data warehouses, and so on, analysis tools, and even textual files that comprises 3,200 or more objects. Then to the right of that, we see here the different reporting systems and the different reports.
What I’d like to jump into now is the first use case, and that is I think similar to what Jim was talking about where you have a CFO complaining about a certain data element and they’re not sure about it. They need to sign on the dotted line. Let’s say it’s the end of the quarter, they need to be absolutely sure. If they’re not, they’re sure going to be getting in touch with you guys or those responsible within your organization that have tried to find out and ensure that’s the case.
Imagine there is a support ticket that was issued by that user or the CFO, let’s say, they need the team that’s required to try to figure out what went wrong what the issue is with that data element, the column, whatever that may be. In order to do that, most likely you need to reverse engineer that report in order to see how the data landed on.
In order to do that, as we’ve been saying many many times, it’s probably going to involve a lot of manual work which is going to be very time-consuming. It’s just not efficient anymore to be working that way, and it may not even be 100% because of all of that manual work. Of course, that wouldn’t be the case with Octopai, let’s go ahead now and see Octopai would address those challenges.
As I mentioned earlier, what we have here on the screen, this is exactly what Octopai’s platform has gone ahead and extracted from the demo environment. It’s gone through the different systems and gathered the metadata here.
Now before we jump into this, in general, in order to address this challenge, most organizations usually have a business intelligence team that’s responsible for that. In order to do that, they probably need to do some investigator work, which is probably starting off with investigating the structure of the report and the reporting system. Everything will then need to be mapped. Then they probably need to contact some different people in different teams, such as, let’s say, a DBA to ask questions about the tables and views that were related to the creation of that report.
Then they would also need to understand if the fields and labels were given the names of it. If not, which glossary at all was used. Now even after investigating all this at this level, which may take hours, if you’re lucky it’s a simple scenario, it might take days, if it’s a little bit more difficult than that and it may take even longer or DBA may tell there’s nothing wrong at this level, maybe the ETL is the cause of our problem so they need to investigate that in a similar process.
All in all, like I said, if you’re very lucky this scenario may take a day, may take actually an hour or two, or a day or two, or even a week or two, or even longer if you’re unlucky.
Now, that’s a fair synopsis of the way that will be handled in most organizations. What I’d like to do now is show you how that would be done within Octopai literally in a matter of seconds and automatically.
Mr. or Mrs. CFO just got off the phone and is telling you that there’s a problem with a report called Customer Product. You jump into Octopai and you jump into the data lineage module, type in the name of the report that they’re complaining about. Octopai goes ahead and filters through your entire landscape showing you a troublesome report. What we see here, we have different options, cross system, inner system, and so on. We’re going to now jump into the crosses so we can understand at a high level where the data is coming from.
When I click on cross system, you can see here that literally in about one second I now understand how that report was created. Once again, at the cross-system level, we will take a look at inner system and then end-to-end, column-to-column as well.
What we see here is the report on the right-hand side. As I simply move to my left I can now start to see how that report was created. What we see here is that there was two different views that were involved in the creation of that report. Clicking on any object on the screen will give us more information, more capabilities, deeper insight. If I click on this view I can now get a visualization.
Let’s say there are lots of transformations, getting a visualization will help me understand those transformations. Clicking on that will show us a video realization from the source to the transformations and to the target.
As I continue to move to my left I see here that we have another view, as I mentioned earlier, that was involved in the creation of that report. We also have three different tables. Again if I click on any one of these tables, I get a radio dial, it comes up, I now have more options of more information.
Let’s say I needed to make a change to that table, a column within that time, I wanted to know what dependent objects might be impacted. I simply click on that six with an arrow to the right and I see that there are six objects to the right of it that would be affected. These reports– Sorry, these stored procedures, tables, and of course this specific measure group.
Now, the question to you is how would that normally would have taken if you were doing that manually? As we continue to move to the left, we see here that there was not just these two views, these three tables involved in the creation of that report, but actually four different ETLs, not one as our DBA suggests might be the culprit with errors in that report. We’re also seeing that they’re from different systems. What we see here ADF, Informatica, SSIS, and so on.
The reason why I’m pointing that out if to show you that the fact that you may be using multiple systems to manage and move your data is not a challenge for Octopai. In many organizations that we deal with, there are many different systems of different ETLs, different data warehouses, databases, data lakes even, there could be even many different reporting systems. The fact that you’re working that way is not a challenge for Octopai. As you can see here, we can still map the path the data has taken regardless of however many systems it’s gone through in order to land on that report.
What we’re showing you now here is a reverse engineering of a report to try trust to understand the root cause, try to understand what happened with that report. We’ve jumped into inner-system lineage as well just to take a peek at that. Would like to do now is just continue on with our scenario.
When we pressed our customer and we asked them what do they think went wrong with that report, they admitted that they had actually made some changes to this one ETL a couple weeks prior to them using Octopai, of course, and as often times would happen whenever they would make any change within their environment there would be oftentimes some production issues that would follow along.
Now, of course, we asked them since they knew there would be production issues after making some changes to that specific ETL, why not be proactive? Look into the impact that those changes would make so do an impact analysis, and then make the corrections and of course save all of the data quality issues, the production issues, and the confidence and the data will actually certainly go up if you remove all of that.
Of course, as we all know that’s a lot easier said than done and most organizations to be proactive to try to do an impact analysis before you make any changes, if not impossible is pretty close to being impossible because there’s just too much to look into. There could be literally hundreds, if not thousands, even hundreds of thousands of different objects that could be affected by any single change.
Most organizations because they are forced to work reactively, of course, meaning when they make these changes, they will try to avoid production issues with any knowledge at hand. Maybe they have people in the team that created these different ETLs and tables and so on. If not, maybe they left the organization and so did the knowledge with them.
They could have some spreadsheets. Some of them may be outdated and so on. Using all of that put together, maybe with even a finger to the win and a prayer, they’ll make those changes and 8 or 9 times out of 10 they’ll probably be right on and there won’t be any issues. Then the 1 or 2 times out of 10 that there are, they’ll react to them when they become apparent. The problem with that is that you’re only reacting to what becomes apparent. Meaning you’re only fixing what you know.
What happens to all of those errors in those tables and views and reports and so on or changes and so on that are happening that you don’t know about? Then what’s happening is that’s creating a lot of the data quality issues, a lot of the confidence of the data is low as a result of that, and so on.
With Octopai, we actually switch that, we turn that right on its head. We now enable you or empower you to become proactive, and it’s very simply done. If you need to make a change to this one ETL over here, we cannot give you the full insight and clarity in order to be able to make those changes without hitting or hurting production. If I needed to do that, once again, I click on the ETL, and let’s take a look and see how we would do that.
I’m going to jump in now to the cross-system lineage of that ETL to understand what will be happening, or the lineage of that one ETL. What we see here is something quite interesting. Now, what we see is if you remember, we started off because we had Mr. or Mrs. CFO complaining about that one report over there. Now, the fact that we’ve taken a look at the lineage of this ETL, we can probably say with most certainty that that’s not going to be the end of the story.
Most likely after making changes to this one ETL, you may have impacted some, if not all of the different objects on the screen, including these different ETLs views, stored procedures tables, tabular models, and so on, and certainly, of course, these reports on the right-hand side. Most likely what will happen in a normal or real production environment is that when these reports are open your business users are going to start to notice errors in them.
I hope that they start to notice those errors because if they don’t it will certainly be much worse. If they notice the errors, it actually gives you an opportunity to try and fix them.
They’re going to start to open support tickets. Those who are responsible for looking into the errors in those reports are going to now need to start to reverse engineer those reports because we established that’s the way in order to get to the impact or to the bottom of the issue. Now, it’s not humanly possible to expect that they would understand the root cause to be this one ETL at the get-go, so they will need to invest hours, days, weeks, trying to figure out what these are.
These errors in these reports came in throughout the year by different people at different times, and so they’re going to need to try to figure out– There’s no way that they could know that there is one impact. They will need to reverse engineer those. Of course, we said earlier that it could take hours, days, weeks, and so on, per report. You guys know better than me how many reports you guys are addressing on a day-to-day or on a yearly basis, and I can probably be sure that it is not a six or seven or even eight. It’s probably going to be hundreds, if not, thousands of errors throughout the year.
You can do the math and try and figure out how much time that your teams are really wasting in manual processes trying to understand what went wrong when they could have all of that information at their fingertips with Octopai’s automation literally in a second.
Now I left these two reports here on the side and that is to prove a point. That is, if you’re working reactively, like we said earlier, if you’re lucky you’ll get to 8 or even 9 out of the 10 errors and you correct those before they create havoc, but you won’t get to all of them. Probably, most certainly, you won’t get to all of them.
Some of those reports will fall through the cracks, those reports will continue to be used by the organization, and the organization will make business decisions based on those reports. Of course, that’s going to be the most impactful out of all of those.
All right, so now just to continue on. We’ve shown you so far a couple of different use cases, reverse engineer or root cause analysis, now we’re taking a look at an impact analysis. These, of course, we’ve shown you so far at the system level and the inner system. What I’d like to do now is go deeper.
Let’s take a look and see at this SSIS package over here. Maybe you need to make a change to this ETL, and you want to see what will be impacted at the column level. Let’s click on the ETL package and click view. Let’s take a look on package view. What we see here is one package, as of course, this is a demo environment, if you guys are using SSIS, then most likely you may have multiple packages here. You would see them here and you would also see the relationships.
I’m going to go further deeper into the container. Now let’s dive into the container and take a look and see. What we see here are the different data flow tasks within that container.
I can delve into each one of these separately to take a look at the logics and transformations that take place in that specific process. Double-clicking on that will give us now a map view.
What we’re going to do is now take a look and see here. Right now we’re at the inner-system level of lineage and we’re going to take a look at a specific field. Let’s say it’s a product category ID. As soon as I click on it, I can now see the journey of that field throughout its transformations here in orange. I’m going to increase that legend and move that up. We can see here the source, transformation, and target, and that is at the inner-system level.
Now let’s say I needed to understand the specific logic and transformations within this process. I would like to actually like to see the entire journey from end-to-end at the column level. It’s very simple. You click on the field you’re wanting to look for, you click on the three dots on the right-hand side, and now what we’re going to do is going to jump into the end-to-end column lineage. Now if I click on that I can now actually see at the very column level from the minute it enters into the BI landscape all the way to the reporting system, what would be impacted at the column level.
Now, of course, we can go higher, so if we need to still see the impact from the column level but at a higher perspective from the level of either the table, schema or even database, we can actually see that now as well here within the platform itself.
All right, so what we’ve shown you now so far are automated data lineage showing you impact analysis, showing you, a root cause analysis at the cross-system or inner system and end-to-end column-to-column data lineage. Since we have a few minutes, what I’d like to do now is just jump into one more use case within our discovery module. What I’d like to do now is show you that specific use case. Let’s hear the scenario.
The scenario is this. You’ve been notified by IT that one of the operational systems you’re sourcing data from is actually making a change to a column, and your team is responsible for understanding the impact of the change and preparing accordingly in order to streamline this “small change”. BI teams are faced with this challenge on a regular basis, and this is the good case that you’re notified prior to the change. Sometimes you only hear of the change after the fact which, of course, puts more pressure on the whole process and on making the necessary change in the BI environments as soon as possible.
When this happens, it’s likely that you need to reach out to different experts in your team, each one to check the impact of the change within their domain. That’s a lot of time being invested by many different people whose time could be put to much better use. This is all, of course, before making the changes. Within Octopai, we made the discovery process as easily as possible searching through all of your different systems with one click.
For instance, the team that was notified that column product ID is being changed in the sourcing system, you can now actually locate all of those uses within the BI landscape and it’s very simple. We’re going to type in the term product ID here in the search area. I’m going to type in productID, one word. As soon as I type that in and press enter, we can see here the different ETLs using this column as a source for instant ADF, and of course, data stage and SSIS in field sources.
Now what we can see here is that the databases that contain this column in different formats. Snowflake, for example, has it here in columns. Then what we see here, this column is also being used in Snowflake and SQL server objects such as views, functions, stored procedures, and even triggers. You could also see that the column is not in Oracle as we see it. Where is it? Over here. We actually don’t even have Oracle here. Okay, the column is also used in the SSAS as within attributes and tabular columns as well as different business objects, Power BI and SSRS but not in Tableau as you can see here.
Let’s go back to SQL Server and what we’re going to do now is click on objects button. Here we have objects; we have it here 10 times. Once I’ve clicked here, we can see that the term ProductID is 10 different objects. If we click on the button, we can actually see the details of each of these different objects including their definitions. By clicking on the definition, we can actually see the SQL that creates each one of these objects. Let’s click on definition, let’s choose that one, and we can actually see in the SQL here on the right-hand side in the sidebar.
Now if you would like we can actually click on a map view of it and it’s very simple, gives us a visualization of that object, in this case, the view itself.
You can see here in just a matter of seconds how Octopai helps you scope the necessary changes and the impact that those changes might have within your landscape. That was basically everything that I had to show you here today in my demonstration. Give me just a second, I’m going to turn that off. I’m going to show you my last slide and open up for questions and answers.
Jim: Thanks, David. Let’s move into our Q&A period now and answer some audience questions. I’m going to start with this one for you, Jim. How can data lineage support my organization’s response to regulatory compliance mandates?
James: It’s very clear how data lineage supports compliance. It lays bare and makes transparent the entire life story of data from its creation through its use and every processing step in between. What if you have strong data lineage capabilities? You’re able to produce an end-to-end flow visualization documentation that show that you have handled, whether it’s health data, other PII, or whatever it is. You handle it in full compliance with all the relevant regulatory mandates related to privacy, bias, confidentiality, and the like.
Without strong lineage analysis capabilities, you can’t roll up a complete story, an accurate document to present to the authorities to show that you in fact have used due diligence to maintain compliance.
Jim: David, I’m going to address this next question to you. Can this tool crawl through blob storage and read binary files?
David: Good question. The answer would be no. What we do support are the most common BI tools such as in ETL data warehouses analysis reporting tools. However, we do also support sprocs, stored procedures, and we also do support XML and CSV and other formats, but not what you just mentioned.
Jim: Say sticking with you, can you describe the automation process of the metadata? Is it done daily or weekly, and how accurate and up-to-date is the information in the platform?
David: Sure. Great question and thank you for that. Like I mentioned in my presentation, we actually extract that metadata for you automatically. We also analyze it for you automatically. The setup process literally takes no more than one to two hours at the very most. All you’re doing really is you’re pointing the Octopai client to the various systems in order to extract the metadata. We actually provide you with the instructions and where you need to point it to. You hit the run button, Octopai goes ahead extracts that metadata, saves in an XMLPort format.
Those XML files are uploaded to the cloud where it’s then analyzed and then you have within 24 to 48 hours a complete analysis of your metadata providing you with that end-to-end data lineage in the multi-dimensional format.
Now, you can do that on a weekly basis at the very most. That’s sufficient for most organizations because most organizations will work on development throughout the week, upload that over the weekend, and Monday morning begin again. If you’re working that way, that works perfect for Octopai because we can provide you with an update of that analysis on every Monday morning.
Jim: Jim, my next question is for you. What’s the difference between data lineage and data governance?
James: Oh yes, well, data lineage as I’ve explained today, illuminates data’s full life cycle. The story of from birth to consumption of data where data governance refers to things that strong data lineage tools will illuminate, which are the policies and controls that are applied to data throughout its life cycle such as version control, such as match, merge, and correct different data sets that are brought together such as role-based access controls, permissioning, and so forth. Lineage illuminates governance or provides a strong set of tools to show that governance is being applied to data throughout its life cycle.
Jim: David, back to you. You mentioned several use cases in your demo, which use case is the most common among your customers?
David: That’s a good question and very difficult for me to number, but I would say everything that I did mention are generally the types of challenges that our customers are coming from, and that is impact analysis, that is root cause analysis, migrations, and upgrades. In order to be able to provide audit trail and compliance, those are usually the challenges that they’re facing when they come to us. I couldn’t really tell you which one is the most.
Jim: Jim, here’s someone seeking some advice. What’s the logical first step in building out a data lineage capability in my organization?
James: Right. Really, the logical first step is to take inventory of your company’s entire, what we often call data estate, all of your data assets. That first step and that’s a huge undertaking and so, relies on data discovery tools and so forth, data profiling, and the like. That first step enables you then to populate data catalog with the metadata and so forth, locations of all this data, which should enable you then to map all the dependencies among the different data sets, tables, and so forth that you have.
That first step though is to take inventory of really all of your data assets. Without that, you can’t really have strong lineage capabilities because you don’t know exactly what you have.
David: If I may add to what you just said, Jim, that’s also another perfect use case for Octopai. By connecting Octopai or pointing Octopai to those different systems within the landscape, we can actually do the analysis for you and give you a complete inventory of your data assets automatically.
James: Excellent.
Jim: David, why do you focus on lineage just for business intelligence and not for the entire organization?
David: Well, that’s a great question as well although that’s what the company was created for. That is the scope of Octopai because we can still provide you with lineage regardless of the fact that we may not be analyzing the different systems that the organization will be using such as Salesforce, for example, CRMs or HRMs, and so on.
The way we do that is as soon as data enters an ETL, which is of course where it’s going to be beginning to be used– As soon as it enters an ETL that we support and then goes through your landscape or your pipeline, then Octopai can actually capture that lineage all the way from the source to the target regardless of the fact that there may not be one of those business systems that you might be referring to.
Jim: David, sticking with you, can you securely extract metadata to the cloud? What about on-premises technologies? How will you extract that information?
David: Sure, absolutely. Octopai works either on the cloud or on-premise. The way Octopai works is in a very secured fashion. First of all, we are ISO 27001 certified. The way it works is, first of all, we do not access your systems at all. The way it works, we actually send you an Octopai client. You know what? I think I might be able even to share a slide that might be able to address that. Actually, I can’t share it.
In any case, the way it works is that we will send you an Octopai client. That client will be installed in your environment. That can be on a laptop. It doesn’t have to be on a server. The person who’s setting this up will then configure Octopai to connect to the different systems where it will be extracting that metadata from. That metadata is extracted and saved in XML format.
Now, there is a pause here that those XML formats are saved locally first. Your security team can actually go ahead and inspect those XML files to ensure that there is no data in them. Only once then it can then be uploaded to the cloud. Of course, it is not anywhere on the cloud, we’re using Azure, and of course, their security infrastructure. It is going to be on a specific customer portal. There won’t be a shared environment. Like I mentioned earlier, we are ISO 27001 certified, so with all of that, it is a very secured way of doing the analysis of that metadata.
Jim: David, your demo has inspired a whole lot of specific Octopai feature questions. If you don’t mind, I’m going to go through these rather quickly. First of all, does Octopai need access to all your source systems, or just ETL and reporting metadata?
David: Like I mentioned earlier, we can provide you with that lineage from source to target from even all of those different business applications without having to connect to them. The way we do it is we gather that metadata. Once that data enters the BI landscape from the ETL forward, we can then provide you with that lineage from source to target.
Jim: Can Octopai extract metadata from SaaS systems?
David: Yes, absolutely. Whether it’s on the cloud or on-premise is not a challenge for Octopai. Of course, it needs to be one of the systems that we support within the BI environment.
Jim: What ETLs does Octopai support?
David: Very good question. You can see that on our website. I will just bring it up on my computer to share some of them for you. Some of the ETLs, of course, are like Informatica, Azure Data Factory, stored procedures, Oracle’s Warehouse Builder, data integrator. IBM Data Stage Talent will be available by the end of the year. Of course, like I mentioned, Informatica and others as well.
Jim: How difficult is it to get the data into Octopai for Power BI? Does it allow for two-factor authentication if that is the only way to get it into our BI reports?
David: Let me just make it clear. Octopai does not analyze any data whatsoever. It will only be metadata. The way we extract the metadata from Power BI is similar, almost identical to everything else that we do. It’s via our platform. You’ll have instructions on how to do that. Do we support two-step authentication? Yes, we do.
Jim: How does Octopai figure out source columns to target column mapping if the ETL code is written in non-SQL code like Python?
David: Good question. Unfortunately, we do not currently support Python. That is in the talks to be supported in the near future, but currently, it’s not supported.
Jim: Can the tool be integrated with data governance tools?
David: It can. We have direct integration with some of the industry leaders such as Collibra. We also have APIs that can be called upon to integrate with just about anything else regardless of whether it’s a data governance, platform, a dashboard, or anything else that you’d like to integrate Octopai with.
Jim: Is Octopai able to trace column values when naming is not the same in multiple source systems?
David: I believe the answer to that question is yes, but I can’t give you a concrete answer right now, so we will answer you via email. We will be getting a list of the questions and those who asked them at the end of the call. We will send you a complete answer via email.
Jim: Does Octopai show the detailed transformation logic that may be embedded within the ETL process?
David: Yes, of course. I showed some of that or alluded to that within my demonstration as you saw in the process. We had source transformation and target available and you can actually see the transformations.
Jim: Your demo showed that using the UI after the data is loaded, but how does it ingest the metadata? Does it support connectors to sync, sort, ADF, [unintelligible 00:54:34] tools automatically?
David: It’s very similar to some of the answers I’ve given. The Octopai client is sent to the customer. That Octopai client is configured once in order to extract the metadata from the various systems. We provide you with the instructions. That entire process shouldn’t take more than hour or two.
Jim: Can you search out a logical name instead of a physical name?
David: I don’t know the answer to that. We’ll have to answer you offline.
Jim: Does the tool have the capability to read operational metadata and provide lineage? For example, one generic ETL approach is metadata-driven approach where the content is stored in a file or table and then you run the ETL multiple times with different configurations. Can this tool capture the lineage in this case?
David: I believe that the answer would be yes. The perfect way to figure that out is for you to schedule a call with one of our representatives. If you can actually send us a sample, we can analyze it for you and give you a concrete answer on that.
Jim: Great. Well, if you’re not too tired, here’s some more. Can Octopai provide data lineage across the data lake environment?
David: Currently today, we do not support data lake, but the good news on that is we will start to support data lakes in the very near future. We are in development to support Hive at this point as our first one and we’ll be adding more going forward.
Jim: What are some of the challenges of column-to-column management when working with NoSQL sources?
David: Good question. A bit above my technology level, so we’ll have to answer that offline.
Jim: How does Octopai differ from other data lineage tools?
David: Good question as well. Like I mentioned a few times in my presentation demonstration, there are many tools out there that do some of what Octopai does, but certainly, there are none that do exactly what Octopai does. Meaning providing you with that multidimensional lineage across system, intersystem and end-to-end column lineage, providing you with the data discovery capabilities and automated data catalog. All of that, of course,is done within one-hour setup.
Having said all of that, in addition to having said that, there are systems or tools that are out there that will provide you maybe one type of lineage. Maybe inner system at the high level will not be able to parse code and the SQL and show you the level to what Octopai does, and certainly not be able to do that in a one-hour setup. Hopefully, that answers your question.
Jim: Well, we’ve come to the end of our hour. Let me take a moment to thank our speakers today. We heard from James Kobielus with TDWI and David Bitton with Octopai. Again, thanks to Octopai for sponsoring today’s webinar.
Please remember that we recorded today’s webinar and we will be emailing you a link to an archived version of the presentation. Feel free to share that with colleagues. Don’t forget, if you’d like a copy of today’s presentation, use the Click Here For PDF line.
Finally, I want to remind you that TDWI offers a wealth of information, including the latest research, reports, and webinars about BI, data warehousing, and a host of related topics, so I encourage you to tap into that expertise at tdwi.org.
With that, from all of us here today, let me say thank you very much for attending. This concludes today’s event.