Why Automated Data Lineage is a Must-Have for BI & Analytics

Play Video

Where did the data in this report come from? Are you sure this report is accurate? If you’ve been working in BI for longer than 5 minutes you probably get these questions all the time. Watch the replay of our webinar where we dig into the challenges BI & Analytics professionals face every day and how automated data lineage can help.

Where did the data in this report come from? Are you sure this report is accurate? If you’ve been working in BI for longer than 5 minutes you probably get these questions all the time. Watch the replay of our webinar with Philip Russom of TDWI who digs into the challenges BI & Analytics professionals face every day when it comes to answering these questions. It all boils down to being able to find, understand, and trust your data. But how?

Video Transcript

Jim Powell: Hello, everyone. Welcome to the TDWI Webinar Program. I’m Jim Powell, editorial director at TDWI and I’ll be your moderator. For today’s program, we’re going to talk about why automated data lineage is a must-have for BI and Analytics. Our sponsor is OCTOPAI. For our presentations, today will first hear from Phillip Russom with TDWI. After Phillip speaks, we have a presentation from Mark Horseman from NAIT and Amnon Drori from OCTOPAI. Before I turn over time to our speakers, however, I’d like to go over a few basics. Today’s webinar will be about an hour long.

At the end of their presentations, the speakers will host a question and answer period. If at any time during the presentations you’d like to submit a question, just use the Ask a Question area on your screen to type in your question and send it over. If you have any technical difficulties during the webinar, click on the help area located below the slide window and you’ll receive technical assistance. If you’d like to discuss this webinar on Twitter with fellow attendees, just include the hashtag TDWI in your tweets. Finally, if you’d like a copy of today’s presentation, use the Click here for a PDF line there on the left middle of your console.

In addition, we are recording today’s event and we will be emailing you a link to an archived version so you can view the presentation again later if you choose or share it with a colleague. Okay, again today we’re going to be talking about why automated data lineage is a must-have for BI and analytics. Our first speaker today is Philip Russom, Senior Research Director for Data Management at TDWI. Philip is a well-known figure in data warehousing, integration, and quality, having published over 600 research reports, magazine articles, opinion columns, and speeches over a 20 year period.

Before joining TDWI in 2005, Philip was an Industry Analyst covering the data management industry at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. Phillip, I’ll turn things over to you now.

Phillip Russom: Well, thank you, Jim, for the very nice introduction. Also, my special thanks to everybody listening to us today. We know you’re very busy, and we all deeply appreciate that you could find time to join us today. As Jim was explaining, we’re going to talk about data lineage from different viewpoints. Let me get things organized, at least for my part of the webinar. I like to begin a webinar by telling people the takeaways, in other words, ideas that I hope you’ll remember, and take action on when you leave the webinar. For example, today, we’re going to talk about how business users won’t trust and use the data that you provision for them unless they know its origins and transformations.

I’ll eventually explain why that is and what you can do about it. One thing you do about it is that you need data lineage tool functionality, so you have accurate information about data. When you are questioned by users, developers, auditors, governors [silence] and data managers, business managers, et cetera. Data lineage is also useful in other use cases, say for many scenarios in data-driven development, reverse engineering, compliance relative to data, self-service data access, data migrations, and so on. Data lineage addresses all the above plus many other issues, we’ll go through a lot of those for you today.

Also, I would add that there are many compelling business and technical use cases for data lineage. One of the things we’ll highlight is the special role of automation in advanced forms of data lineage. I’ll summarize all this at the end of my presentation as well. Right now, it’s time for a poll, so put your hands on your mouse and get ready to click your response here. In our first poll, the question is, do you agree that enhancing the capabilities of the BI team will increase business users’ trust in their data? If you would, click on one of the four possible answers here, and I’ll give you a moment to–

Also after you click, be sure you click the Submit button. I’ll give you a few seconds to get your answers in there. [singing] Ooh, da doo doo doo doo doo, doo da, da da. Okay, let’s move on. Let’s see what kind of responses we got out of this. Here we go, so most of you agree, the big long blue bar at the top here it said the fact that most of you listening today, 60% of you agree that enhancing the capabilities of the BI team will increase business users’ trust in their data. I think this is true, whether you’re collecting data lineage information specifically or doing other improvements, and more of you agree at various levels.

Yes, I think we’re on the same page here, we all see that there’s a need for these kinds of improvements. Now, I did mention that the users have questions. Well, let’s start there, what are some of these questions? Why would they ask, why do they need answers? We have some common and important questions concerning business intelligence, advanced analytics, data warehousing, and so forth. I got to say, the most common question I’ve ever heard is, where did the data in this report come from? People want to know this before they make an important decision, right? Otherwise, they don’t really know where the data came from.

Every data source has some bias to it, for example, customer data coming from a sales pipeline application is actually quite different from customer data coming out of say, a call center application, right? People want to know, where does the data come from? Also, how did you aggregate and transform it? The mix of data sources will actually have a bias, have an influence on the state of the data as well. People want to know this stuff before they make important decisions about it.

There are the important questions as well, for example, who has used this data? For example, if say, you’re using a self-service data access tool, and you can see where your colleagues have been using certain datasets, that’s actually a positive guidance for you, it can tell you that your colleagues are getting value out of certain data sets, maybe you should look at them yourself, this kind of thing. Who’s used the data is important, a different scenario is data governors can look at who’s using the data and spot inappropriate users and that kind of thing.

Also, people want to know, what’s the quality of this data, and that’s part of a trust value that they’re going to assign to the data. Also, if they have quality problems with the data, this is going to affect downstream work, they’re going to have to do to fix data quality problems. It’s a project planning thing as well. Overall, all the above and other questions go together in determining what’s the level of trustworthiness for the data in question? A solution to help you have really quickly available answers to also have answers that are credible, that make sense to people, then data lineage is an approach to do this.

Data lineage tracks and records the journey that data takes from primary and secondary sources, through a variety of interfaces, and also how it got transformed to repurpose the data for specific use cases and user groups. Data lineage tells you how data was aggregated into target databases. In my world, probably the most common target database for this would be a data warehouse, data marts, but newer stuff like data lakes, and then finally in what delivery mechanisms has this data been used? Things like BI products, reports, analyses, dashboards, and so forth.

Also, data lineage can record other attributes of data, for example, what data domains are these? Subject areas, other categories? What’s the condition of data’s quality, the metadata, its modeling? What’s the advantage? How old is this stuff? Is it fresh enough for me? Finally, in multiple use cases, it can also list things like what use cases and apps the data got used in. Then finally, we’re seeing a lot of data lineage tools that automatically generate a data map from all the above. That’s a great thing because you don’t want to put the map together yourself, that could be time-consuming, so automation for that map is a big deal, and we’ll come back to automation a little bit.

Why do you want to have data lineage? Well, it’s really the negative ramifications of not having it. If you don’t have credible answers through data lineage information collected, users may not use the products, don’t use their data. One of your metrics for success as a data professional is the question, are people actually adopting and using data sets, the reports, the analyses that we put in there? You could be considered a failure if adoption does not work out among users. Again, if they don’t have the trust, they won’t do the adoption. Also, a problem is that some users will compensate for the lack of data lineage information by creating their own data. That way, they at least know where the data came from, even if their low end data sets aren’t the best.

The scariest example of that would be rogue data marts. Data lineage information helps you to avoid that rogue behavior. Then finally, without broad data lineage information, a lot of tasks are slower and inaccurate. For example, developers will take longer, developers may select inappropriate data sources. In other areas, say it’s more of a self-service scenario. That kind of user needs all the guidance they can get. Information about data lineage provides very valuable guidance for them. Then finally, I personally have lived through audits. They’re pretty scary. I know from experience, you want the audit to go as quickly and easily as possible. The longer the auditors linger, the more dirt they find.

You really want data lineage and other documentation to help you speed audits. No matter who’s auditing you. Whether it’s your own management, or a regulatory agency, or something like that. Hey, it’s time for another poll, so get your hand on your mouse, get ready to respond. The question this time is, when a business user asks where a particular data item came from, how confident do you feel in your ability to provide an accurate answer in a timely fashion? You can see four different answers. Please click on one, and then click on the submit button.

I’ll give you a moment to make your choices. [singing] Okay. Let’s look at the responses. Oh, good. If you’re seeing what I’m seeing, there’s a big green bar. Most of you selected, “I’m pretty confident that I can get the answers and do it in a timely fashion.” Almost none of you have no confidence at all, so that’s great. That’s great. It’s a positive thing that you can get the answers, but part of our point is, if you have the automation provided by data lineage tool functionality, and you can do this stuff even faster with greater accuracy. In a lot of cases, the users don’t even need you as a data professional.

They can quite often go to the lineage tool themself, and get this information as a self-service fashion. Let’s move on to some business use cases. I think data lineage increases the business value from a wide variety of analytics investment, as well as raising adoption rates. Also, data lineage can be a real boost to any kind of programs that are tracking compliance, governance, stewardship, data curation, compliance with very specific regulations like GDPR, CCPA, and so forth. For a lot of people, data lineage is actually the proof. They can say, “Look, our data lineage tool, it tracks data usage. Based on that information, we know we have achieved compliance.”

In some cases, you have to prove that you are complying. I mentioned audits earlier, so that is a business problem. Any help you can get to speed an audit is a great thing. Finally, data lineage helps to guide self-service users. From a technology viewpoint, there are also compelling use cases. There’s guidance for developers, also a productivity boost for developers. They’ve just got more information, so they don’t have to create and recreate that information themself, which can be time-consuming for them. One thing I dwell on here. I don’t think yet we’ve mentioned how a lot of times, you have to reverse engineer something somebody else built.

Sometimes it’s that rogue data mart I mentioned, and you get stuck as a data professional trying to figure out, “Where is the data in this thing come from, and how do we fix it?” That sort of thing. In other cases, it’s very sophisticated solution that you inherited from a prior technology colleague. You’ve inherited it, so you have to figure out. The more information you have about data, its lineage, its use case, et cetera, the faster reverse engineering will go. Also, data lineage can help you with continuous improvement. Whether that’s a data quality issue, or say finding redundant data sets emerging them, or finding data that’s just not being used, and maybe it should be taken offline, or at least move to a cheaper form of storage, and so forth.

Finally, from a technology viewpoint, everything we’re talking about today adds up to a heck of a list, doesn’t it? You need automation instead of having to do a lot of this stuff in a manual and time-consuming case. With that in mind, let’s look directly at automation. With advanced tools for lineage, I see many of them can scan data automatically. When I say scan data, I mean they can reach out to a wide variety of data sources and targets across an entire enterprise. What the tool does typically is to create and maintaining a broad data map.

This data map can be extremely useful for a wide variety of stuff, not just data lineage. The automatic lineage map, it saves you a heck of a lot of time as a developer, right? This stuff is automatically created for, I’ve talked to users who’ve done this thing in a couple of hours. Whereas, you could easily put man days into this kind of thing. The lineage-driven data map gives further automation in other areas. It makes reverse engineering even faster. A tedious part of your job quite often is doing mappings from sources to targets. With data lineage, when you have this map, it just makes the drag-and-drop process of doing those a lot better. Having this full information about data also gives you the ability to do impact analysis.

That way you can understand the ramifications of changes before making those changes. Finally, the data lineage map provides a view of all data sets. It gives you a pretty unique big picture, and that’s what the next slide is about. Imagine the big picture of enterprise-scale data as seen through data lineage map. It’s really showing you an inventory of data to be queried, browsed, searched, used in a wide variety of ways. The map also visualizes data structures. It helps you understand, what’s the bigger architectural patterns going on here? What are dependencies among independent data set?

The data lineage map helps you develop an excellent understanding of very complex data environments. For a lot of our members at TDWI, these complex environments are for data warehousing, data lakes, analytics program. There are also operational environments that have complex data environments you need to understand as well. The big ones there would be marketing, especially where you’re doing very complex digital marketing, online campaigns, and also multichannel marketing. The fully modernized supply chain involves a heck of a lot of data for mini sources and its mini-structures.

Sometimes it’s messages, other times it’s documents like XML, JSON,  et cetera. Financial data is equally complex. Furthermore, the data lineage map shows you the current data platforms, and also the new ones in one view. You can see the old platforms, the new ones. This is really important if you’ve been doing upgrades, and so you have a lot of new systems. In particular, if you’re doing a lot of cloud migrations, or you’re just adopting software as service applications. The ability to see on-premises and cloud data in one place can really give you a much deeper understanding of your current inventory, as well as the architecture and other design patterns they made across them.

In my summary here at the end, let’s talk about why automated data lineage is a must-have for BI analytics. Remember, that’s what the title of this webinar claim. I think one of the reasons it’s a must-have is because without credible data lineage information, users will not trust and use your data. In response, some of them may go off and create rogue data sets, or use data in non-compliant ways. When you do have broad data lineage information, many tasks are accelerated in terms of time. Things like data-driven development, like creating new data sets can go faster with a lot more accuracy, be fully documented, and therefore, easily maintained in the future.

Self-service environments have better information to guide the kind of person that’s doing self-service, audits go faster, and so forth. Data lineage supports compelling business use cases. It gives you BI that’s trusted, used, and accurate. Also, it’s a very good thing to have for compliance audit and self-serve. Data lineage supports compelling technology use cases. Accelerates development, prototyping, reverse engineering. Then finally, data lineage is best when it’s automated by that data map I was talking about, and you want to map, it’s automatically generated, but of course, you might want to manually tweak it.

Some of you may already have some information that could be uploaded to expand and contribute to the map. This data linage map can help you with things like mapping from sources to target, doing impact analysis, and just getting the big picture of your enterprise data inventory. All right. Well, Jim. That brings me to the end of my presentation here. Can you segue us to the next speaker, please?

Jim: Thank you, Phillip. Just a quick reminder to our audience. If you have a question, you can enter it at any time in the ask a question window. We’ll be answering audience questions in the final portion of our program. Our next speaker is Mark Horseman from NAIT, the Northern Alberta Institute of Technology. Mark is an IT professional with over 20 years of experience. Mark moved into data quality, master data management, and data governance early in his career, and has been working extensively in BI since 2005. Mark is currently leading an information management initiative at NAIT. Welcome, Mark.

Mark Horseman: Thank you for that excellent introduction. It’s really exciting to be on the call today and sharing our lineage journey with everybody at this TDWI webinar. Thank you for having us. NAIT is a polytechnic in northern Alberta, Canada. Our chief mission is a promise to students, industry, a promise to our own staff, and a promise to the province. We are essential to Alberta. A short while ago, I’m going to go over our management to how we arrived at lineage being an important aspect of our data management program and then get into some of the details of the benefits we’ve realized. A short while ago, NAIT brought me over to kick off a data management initiative.

They had some struggles and then some issues and really wanted to bring clarity to the institutional data assets and trust. The first thing that I did when I arrived at NAIT was drink a heck of a lot of coffee. Talking with all of the stewards, all of the folks in various business units at our institution, and we have about 2,500 staff, so making sure we understand the challenges that people are facing out in the wild. That is one of the most critical things that we did early on in beginning our data management journey. It’s important to understand every issue that people face as it relates to consuming data as part of a decision-making process. We need to understand the issues that they have with data.

As Philip was saying we heard trust, trust, trust. I don’t trust the data. I don’t trust this number. As you can imagine, making decisions on how many instructors we need to teach, how many students, trust is an important factor to make sure that we deliver a quality product and achieve our promises to our staff, our students, our province, and our industry partners. With that, we started a catalog of those issues faced. Where’s the trust-breaking down? What are people doing out in the wild in the various business units at our institution? Then we built what’s called a common data matrix. This is a data management artifact that we can build out.

That helps us understand the key players of data at our institution. Who is responsible for the production, creation, and usage of information. Just at a high level, so that we can build out how all that works. As we build that data management practice out, we need to focus for that. Based on what we’re hearing as issues you can have a lot of different focuses for any kind of data management initiative. People often focus on cyber security or purely on business intelligence or on data quality. Being that trust was a huge factor in what we heard, we felt that it was absolutely critical that we focused on data quality.

That means that artifacts and the nature of our data management initiative is going to focus on people believing the numbers that they see, people having trust in the accuracy of that information that they’re using on their reports. Once that’s all understood and built out, we need to make sure we align to the business. Any data management initiative that you undertake needs to align to something that the business intensely cares about. As you can imagine, higher education, financial sustainability is of huge importance right now. Our ability to work very closely with the financial sustainability team at NAIT really gave us the lifeblood that we needed to build that trust and build those relationships.

As we did that, we defined a process where we would put a stamp on a report, a grade on all the BI artifacts that people consumed. That grade was based on do we have the business rules and definitions understood? Are the reports using the architecture that’s appropriate? Most importantly, data lineage. Where did everything come from? Do we have that documented? Can we confidently say what happened to the data before it showed up on a report? One of the key figures we use in Alberta is called full load equivalent. As how much of a full load equivalent of classes is a student taking in the context of an academic year?

Being able to calculate that number, while it may sound easy, there’s a lot of complexity to that. Being able to explain and show the lineage of that, it’s critical for people to make decisions. Just to dovetail off of what Philip was saying earlier, when you don’t have that trust and that trust was at a low level when I started the journey here at NAIT, when you don’t have that trust, you get rogue data sets out there. We had a lot of different pockets within various business units at NAIT that were maintaining their own data sets for decision making because that trust level wasn’t there.

When I was looking at this and seeing this happen and I’ve done data management for a number of years now and a lot of different vectors, what I really noticed was that this started with good intent and they needed to do something but then it turned into this broad nagging monster that was unmaintainable and whose quality was suspect and people spending huge amounts of time to maintain this data set. Also that they could make the decision that’s critical for the success of their business. What we needed to do was look at why that started and look at bringing them back to a centralized trusted institutional data source. To bring them to that data source, we need to build that trust and data lineage is critical to building that trust.

We knew that there would be a lot of ROI in automated lineage. When looking at lineage and when NAIT brought me in, I was a team of one. [laughs] You can imagine the vast scope of looking at our entire BI environment, our entire analytics environment, and thinking to myself how on earth is it even possible for me to document the lineage of all these things that people use. How is this even achievable? It’s so huge. We have thousands of reports. We have many enterprise systems. We have a fairly complex data warehouse environment. How is this all going to be even possible? It was immediately apparent that the only way we could achieve this and actually bring value to people was to automate it.

Great. We’ve automated it and we’re looking at an automated solution. How do we know the business is going to interact with that data? Again, drinking a heck of a lot more coffee. Whew, I got a lot of Starbucks points, that’s for sure. [laughs] When we’re out there talking with various consumers, we’ve got various business users that are creating their own data sets or using our data sets, and then we can show them, “Hey, this is a solution we’re looking at. This is what it’s going to be able to do for you. How are you going to be able to interact with that data?”

We heard from various stakeholders at our campus that that would be huge for them. They wouldn’t have to spend weeks and weeks and weeks trying to figure out where everything came from or how this happened or how this number changed or why this number changed, that we could have a key resource to help disentangle the complexity of our BI environment. Again, we engaged with the business users and determined that there was a lot of ROI to be had. Speaking of ROI, I’ve just got a couple of specific cases here and I might just launch into storytime. [laughs] What we have actually seen since implementing an Automated solution, it’s very helpful for local BI development. It’s a great way for me to have an eagle-eye view on the entire architecture of our system, especially as it changes and we’ve seen usage outside of our technical team. We’re seeing usage out in business units and we’re seeing usage in our institutional research group. That’s a group that asks hard questions and really interrogates data. There’s a number of things that we’ve done with Automated. I’d like to talk about these three in particular.

One of the challenges we were faced with was creating an integration to one of our enterprise systems. A number of our integrations where we’re passing customer data from one system to another come through our warehouse environment. This gives us more of a hub and spoke type architecture instead of having one enterprise system ham-fistedly mash data into another enterprise system. It gives us a lot more control, especially over the quality of the data that goes into that. Many, many moons ago, I would guess in 2003, 2004, 2005 somewhere around there, we created an export of our alumni to go into our alumni system.

As you can imagine, as a post-secondary institution, donations from our alumni are very important to us, and being able to understand who has graduated and become NAIT alumni is critical to making that work. It was important for us to look at how the old integration worked to start building a new integration. This had been a black box for many years. People had struggled with managing this. In the old implementation, we had a whole set of data streaming out of one system, going into our warehouse, streaming out of our warehouse, going into an access database of all places, and then getting mashed up in that access database to go into another enterprise system at the end.

All of those steps were whew, oh boy. With automated data lineage, we could see what was happening. We could see the exact data that was going out. We could see where it came from. We saw that in minutes, we took like half an hour to say, “We can do this. We know where the data’s coming from.” A task before that would’ve taken us weeks of analysis, we had an answer at our fingertips because it was automatically documented by a lineage solution. That’s one thing, working with the new integration. We’ve got a new incoming president, which is very exciting for us, lots of great new ideas, and a new strategic direction for our institution but part of that requires an ability to trust data.

One of the tasks that we were faced with was the development of a storyboard. What is the story of our student body? How are students interacting with our institution? How does the relationship to the institution change over time? How do they progress? How well are they doing in classes? Do we need to help students ensure that they graduate? Those are all excellent questions which demand an accuracy of data and trust in the source of the data. As we’re developing this storyboard, we’re ensuring that that storyboard is using trusted data sets and we can trust those data sets because they’re automatically documented in our lineage tool.

There is no weird spreadsheet or massive SharePoint site or anything like that, that we have to maintain. It’s all kept in this storyboard and in the automated tool. That has created a lot of trust and a lot of usage of that new storyboard item. The last thing, this is very recent, as you can imagine, as schools are reopening, because we’re a polytechnic and we do a lot of trades education, a lot of our instruction for those types of programs has to be in person. We have welding labs, we have other types of labs, we have health labs, so people learning how to use ventilators. A lot of activity is happening in person on campus.

There are a number of rules that we have related to the relaunch of COVID. One of those is submitting a form every day saying, “I’m going to campus today, and I agree that I don’t have any symptoms of COVID.” Then submitting that every day. There’s a reporting requirement related to that activity. It’s great. We can produce a report, but now we can also produce trust that that report is getting its content from the right source. We can say if somebody questions, “Hey, why isn’t this showing up on the report or I know this person submitted a form?” We can trace back to the source data very easily, and we didn’t have to document the heck out of everything along the way.

Those are my favorite stories. We have many more stories of success that we’ve had with automated lineage. It’s been a huge boom for us. We save anywhere between three and five FTE, from month to month, just the sheer amount of ease that we can get, what used to be complex answers across the institution is just huge for us. With that, I’ll pass it over to Amnon.

Amnon Drori: Hi everyone. Thank you very much, Mark, for sharing your story. I think that it’s really, really interesting to see how organizations are choosing to become more sophisticated and more equipped with available technologies in order to deal with data management and specifically around the BI. It was really insightful, Phillip, for you to share many, many things from kind of the need for automation. I think that listening to both sessions, there’s one thing that is very clear, the complex around data management, the understanding of data in order to hand it over in a trusted way, which is a word that was repeated a couple of times today, to the business user is something that is really critical.

My name is Amnon Drori. I’m the co-founder and CEO of a company called Octopai. Our story is that in the past five years, we’ve looked around what’s going on around data, kind of the three Vs around the data, the volume, the variety, and velocity. We’ve seen very crazy things that are happening around the need to more data and the growing need within business users who need more data constantly all the time. The problem was that we as coming from leading BI groups in insurance and banking and healthcare, given our background, we came to this tipping point that we felt that we cannot do our job properly. We came to the conclusion that adding more and more people would not catch up with the business demand.

I think that even what Mark said, saving five FTEs, it means in other circumstances that the team was acting now as if they’re having additional five FTE. The need for efficiency, the need to become better is something that we are focusing our efforts and we help BI organizations to become better. What I’d like to focus my time is to show you how really this looks in practice. I’m going to use our demo environment to show you how NAIT is kind of mimicking their capability of lineage.

Before we do that, I want to continue what Philip has been using very, very nicely, which is a very quick poll. Take a couple of seconds just to look at the question. The following question, which is really interesting to me, and I’m going to show you results not only on that poll, is that if you need to make a change in a field in the ETL or in one of the maps or within the database, how long typically would take your data team to conduct the impact analysis, just to check that whatever you are about to change is going to be impacted in a particular field.

Is that going to take you minutes? Is that going to take you hours? Is that going to take you weeks or is that going to take you months? What do you think?

All right. We can move on to the results, I’m going to click on that. I can share with you that you’re a very, very good crowd audience that takes anything from hours to a couple of weeks depending on the need and depending on the use case. In some areas it even takes months, but what’s notable here that it doesn’t take few minutes. This is very much aligned with other polls in other locations that we’ve done simply because of the fact that the BI team is being surprised because if I were to ask you what is the use case on which you would need lineage next week or tomorrow or in two weeks from now?

You don’t really know. The need to address business needs is happening on the spot. When you need to plan a certain change, for example, because you want to enrich a certain report with additional data and transit it from the data sources to the data target, this could be a task that can pop up next week. You can never be prepared on time, in timely fashion, in order to respond and this is where the pressure actually starts. Going back to what we decided to do a couple of years back is that we decided to challenge the status quo. We decided to look at how business intelligence landscapes are being managed, which are fairly manual, fairly silo-based.

A lot of professional services involved, a lot of it on-prem and we said, “What if we could do this differently? What if we can think out of the box and do things differently and see what the benefits are?” One thing we came in mind is what if we do not need to change or to manage each individual business intelligence tool? Whatever you have, either it’s an ETL or a database or data warehouse reporting and analytics, and even to make it more spicy, each one of them may be from a different BI vendor. What if I could take all of that? Either I have one ETL, two databases and one reporting tool or in some cases that we experience, you have 25, 30, 40, 50 different systems and you want to crunch them into one central repository.

If you can do that, you can do also a very thorough analysis of that separated metadata. Once we do, we create a kind of Octopai language that understands the metadata from all of these different vendors’ tool and we leverage the power of the cloud. Now, the outcome of that analysis is being shrink-wrapping different products, some of which you can see right here. The data lineage is something that is really really valuable but the set of capabilities are helping organizations in a lot of use cases, in which some of them you see here but also, these are the typical people that are the direct beneficiaries of using automation of their BI, specifically around data lineage to help them with those different set of capabilities.

If I were to say what is very, very unique that had changed from five, six, seven years ago is the fact that our approach is saying don’t bother with your individual silo tools. Have everything within your platform analyzed, cross-platform, that’s number one. Number two, if you want to enjoy those capabilities, there’s a price for it that’s mainly involved in manual work, customization, understanding documentation in place, sharing the knowledge between the teams. What we say is invest 30 minutes of your time to 45 minutes of your time, extract metadata from those tools with a free service of Octopai, and then allow us to analyze.

If you have 5 tools, 20 tools, it doesn’t matter. Within a day, you’re going to get an access to start looking at what’s going on within your BI. Either you used our data lineage, BI catalog discovery, and so on and so forth. Let’s have a product that for God’s sake is going to be very, very easy to operate. I want to show you a couple of cool things within our product. I’m going to go ahead and share my screen. I’m going to use the next few minutes to show you how it looks like. I’m going to share my screen right now. You should be able to see three round circles that looks like this.

What you see here is a centralized repository in Octopai that consists about 389 ETL processes that are shipping data and storage in 2,500 database tables and views and there are 24 reports that are consuming that data that is stored in these 2,500 database tables and views. Now, you can see a bunch of different tools that are from different vendors like Datastage and SSIS and SQL Server and Cognos and what have you. Let’s demonstrate some of the things that have been talked today. Let’s assume that your business user is saying, “Something is missing in my report.” or, “Data doesn’t make sense to me.” or, “Why am I looking at this data report and it’s half empty? Why am I looking at two different reports being named the same but I can see two different data set results?”

Here’s how you do it. The business request question in the BI language means first of all I need to find where the report is being generated out of the set of different reporting tools. The second thing is how it’s being structured and most importantly where data is coming from. Where does the data that has to do with that specific report reside within the 2500 database tables and views and which ETL processes exactly and collectively are responsible for landing the data in that report? That’s the BI language. This is how you do it. You go to the reporting section here and again, this is metadata that had been analyzed from all of these systems, centralized and analyzed.

I’m going to look for a report, let’s say customer products, and as I’m typing in, Octopai is searching for that report for you. Again, I’m the BI. Now, the only thing I need to do is click this button lineage, which will guide me to the exact tables and views and ETLs processes that are related to landing the data on that specific report. Clicking on that and within three seconds, this is the view that you get. What you see here is that this report is actually based on a view and this view is based on data that reside in tables and views right here colored in red and the data lands on these database tables and views by running these ETL processes.

First of all, for the first time, you can see different set of tools and different sets of data process in one screen, but let’s even play around with it. In some cases, this ETL could be the source ETL for that table but is this really the ETL that captures metadata or data from the applications themselves? Octopai will tell you that the answer is no. Actually, if you continue the navigation, you will find that here are the tables that are sourced to that ETL and if that is the table that is sourced to that ETL, are there any additional ETL that are happening prior to landing the data in that table?

Clicking of that and you can continue the journey as far as you want up to the table source of the application. You can have everything at once, you can control the flow and you can do it the other way around. What if you want to see the impact analysis of changing this ETL moving forward? I want to do the lineage forward, not from the report backwards, from the ETL forward. Now you can see a different map. This picture says that any changes here, it may impact not only several reports which we started this journey from, it will impact all of these reports and guess what?

This one is an SSRS and this one, like in the sales report, is a business object. The message is that all your BI is being analyzed and it’s in the palm of my hand, just to be able to understand what’s going on inside. The last thing I want to move on before I hand it over to Q&A is the last poll on my part, asking you the following question, if you can share. Is automation the key to taking your BI operation to the next level? We started this session with this theme and after this session, it will be really interesting to understand your view on the topic. Take another few seconds and share your thoughts.

Great. I’m going to move forward to the results. As you can see, 35% is “Absolutely” and about 37% say, “Yes, I think so.” The majority is actually going towards in favour of automation. I would like to invite everybody if you would like to hear more about the product, hear about the capabilities, the use cases, why others have chosen to become better by leveraging automation, feel free to contact Octopai. We’re at your service. We’ll be more than happy to conduct 15, 20 minutes demo or even discuss your challenges and to see if we have any fit for your needs. With that said, we’re open to Q&A I guess.

Jim: Thank you Amnon. Let’s move into our Q&A period now and answer some audience questions. We’ll start with this one for the group. Is it necessary to do data lineage prior to data governance?

Philip: Yes. Let me step in on that one Jim. That’s a great question and actually, this is a general question involving data governance has come up quite a bit. Real quick background. At TDWI, we’ve seen a lot of organizations be very successful. [chuckles] In fact, we’ve given some organizations prizes for actually creating a data governance program before other data management programs, especially those that involve lots of change because change has to be managed and the government support’s good for that. For example, you typically want to create a governance board before you attempt a large data quality program or a master data management program.

You know, to be honest, I’m not sure about the order when it comes to governance and data lineage, but I would guess that any time you begin a new kind of data management practice, and lineage will be new to many of you, that it does help to have governance in the beginning. Then my final thought here is all data has to be governed, period. All new data management practices need to be involved with governance, period. Listen, Mark or Amnon, do you have any thoughts to add to this?

Mark: You bet–

Amnon: If I can add– Sorry. Go ahead, Mark.

Mark: Thanks. Yes, really to echo what Philip said, for us at NAIT, understanding our data governance group and building that council of information stewards, we called them, really was a benefit for us to understand what the ROI would be before implementing lineage. Lineage is a component of a data management program or a data governance thing and having those two things work together is really critical. Yes, understand who your stewards are, having that council available to help launch your program and understand your ROI makes a lot of sense. Amnon?

Amnon: Right. Yes, I can share that it very much depends on the use case. I agree with you completely, the data lineage is an important part of the data management. Nevertheless, we have clients that either did not implement data governance, or they are in the journey of considering that while their BI needs to be or wants to be sophisticated tomorrow. That’s the power of this automation. If you want to deal with use cases, you can start working with automate your data lineage while you consider or exploring data governance, and then we’ll be incorporated with that.

Mark: We don’t have our cameras on so you can’t see me nodding profusely in agreement with that.

Amnon: Yes.

Jim: Mark, our next question is for you. What automation tool does NAIT use?

Mark: We are using Octopai.

Jim: Okay. Next question, does implementing a data lineage tool cause an impact on the database or cause applications and reports to slow down?

Mark: I can speak to that as well because we’re actively using the lineage tool. What we do is we have an automated process that runs once a week that pulls all the metadata and then brings that out to our user community. We have not noticed any performance issues, and we’re pulling a large amount of metadata from a variety of sources.

Jim: Our next question from an audience asks, “How do you govern data lineage when business stakeholders are building their own Power BI or other reports?”

Philip: Yes, I’ll chime in on that one. Earlier, I mentioned rogue data march, which is a classic problem. We’ve been fighting that one for 30 some odd years. I’m going to really focus on the data set side of this, not so much the tool side of this. When users create rogue datasets, one of the problems in governing them or even applying a wide variety of tool functionality to them is you don’t know where they put that data set. Right? You really don’t. A really simple thing to do, which we see on the uptick at TDWI and this register is in our surveys, we see people using some equivalent of a data lab or a data sandbox.

The sandbox may be part of the data warehouse environment, it could be part of a data lake environment, whatever. The idea is that users can do all kinds of crazy stuff if they want, but they need to store their rogue data sets in a place where they can be found. Therefore, that way, it’s a lot easier to govern these things. It’s a lot easier for IT to step in when they’re asked to improve them and so forth. That’s a simple trick that occurs to me. If anybody else has a better idea, please [chuckles] kick it in here right now.

Jim: Okay. Our last question is for Amnon. Does your product require any professional services to start using it?

Amnon: The long answer is no, and the short answer is no.

[laughter]

Jim: Okay. How long does it take to get up and running?

Amnon: As I explained, one of the most important things that we have in our automation is the fact that the client needs to spend about 30 to 45 minutes extracting metadata locally. Either they do it themselves or they can use certain extractors that we’ve created, which is free of charge to extract the metadata. Once metadata files have been created either supervised or watched by the client, they upload the metadata files to their instance in Octopai and we need one day, 24 hours. Sometimes with a huge number of systems, we need 48 hours and the next thing they know, we schedule a training call for them to start working. 45 minutes from the customer and that’s it. No professional services, no customizations, no interviews, no IT capital whatsoever.

Jim: Thanks. We’re just about out of time here so let me take a moment to thank our speakers today. We’ve heard from Philip Russom with TDWI, Amnon Drori with Octopai, and Mark Horseman from NAIT. Also, thank you again to Octopai for sponsoring today’s webinar. Please remember that we recorded today’s webinar and we will be emailing you a link to an archived version of the presentation. Feel free to share that with colleagues. Don’t forget, if you’d like a copy of today’s presentation, use the “Click here for a PDF” line. Finally, I want to remind you that TDWI offers a wealth of information including the latest research, reports, webinars, and information on business intelligence, data warehousing, and a host of related topics. I encourage you to tap into that expertise at tdwi.org. From all of us here today, let me say thank you very much for attending. This concludes today’s event.

Video Transcript

Jim Powell: Hello, everyone. Welcome to the TDWI Webinar Program. I’m Jim Powell, editorial director at TDWI and I’ll be your moderator. For today’s program, we’re going to talk about why automated data lineage is a must-have for BI and Analytics. Our sponsor is OCTOPAI. For our presentations, today will first hear from Phillip Russom with TDWI. After Phillip speaks, we have a presentation from Mark Horseman from NAIT and Amnon Drori from OCTOPAI. Before I turn over time to our speakers, however, I’d like to go over a few basics. Today’s webinar will be about an hour long.

At the end of their presentations, the speakers will host a question and answer period. If at any time during the presentations you’d like to submit a question, just use the Ask a Question area on your screen to type in your question and send it over. If you have any technical difficulties during the webinar, click on the help area located below the slide window and you’ll receive technical assistance. If you’d like to discuss this webinar on Twitter with fellow attendees, just include the hashtag TDWI in your tweets. Finally, if you’d like a copy of today’s presentation, use the Click here for a PDF line there on the left middle of your console.

In addition, we are recording today’s event and we will be emailing you a link to an archived version so you can view the presentation again later if you choose or share it with a colleague. Okay, again today we’re going to be talking about why automated data lineage is a must-have for BI and analytics. Our first speaker today is Philip Russom, Senior Research Director for Data Management at TDWI. Philip is a well-known figure in data warehousing, integration, and quality, having published over 600 research reports, magazine articles, opinion columns, and speeches over a 20 year period.

Before joining TDWI in 2005, Philip was an Industry Analyst covering the data management industry at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. Phillip, I’ll turn things over to you now.

Phillip Russom: Well, thank you, Jim, for the very nice introduction. Also, my special thanks to everybody listening to us today. We know you’re very busy, and we all deeply appreciate that you could find time to join us today. As Jim was explaining, we’re going to talk about data lineage from different viewpoints. Let me get things organized, at least for my part of the webinar. I like to begin a webinar by telling people the takeaways, in other words, ideas that I hope you’ll remember, and take action on when you leave the webinar. For example, today, we’re going to talk about how business users won’t trust and use the data that you provision for them unless they know its origins and transformations.

I’ll eventually explain why that is and what you can do about it. One thing you do about it is that you need data lineage tool functionality, so you have accurate information about data. When you are questioned by users, developers, auditors, governors [silence] and data managers, business managers, et cetera. Data lineage is also useful in other use cases, say for many scenarios in data-driven development, reverse engineering, compliance relative to data, self-service data access, data migrations, and so on. Data lineage addresses all the above plus many other issues, we’ll go through a lot of those for you today.

Also, I would add that there are many compelling business and technical use cases for data lineage. One of the things we’ll highlight is the special role of automation in advanced forms of data lineage. I’ll summarize all this at the end of my presentation as well. Right now, it’s time for a poll, so put your hands on your mouse and get ready to click your response here. In our first poll, the question is, do you agree that enhancing the capabilities of the BI team will increase business users’ trust in their data? If you would, click on one of the four possible answers here, and I’ll give you a moment to–

Also after you click, be sure you click the Submit button. I’ll give you a few seconds to get your answers in there. [singing] Ooh, da doo doo doo doo doo, doo da, da da. Okay, let’s move on. Let’s see what kind of responses we got out of this. Here we go, so most of you agree, the big long blue bar at the top here it said the fact that most of you listening today, 60% of you agree that enhancing the capabilities of the BI team will increase business users’ trust in their data. I think this is true, whether you’re collecting data lineage information specifically or doing other improvements, and more of you agree at various levels.

Yes, I think we’re on the same page here, we all see that there’s a need for these kinds of improvements. Now, I did mention that the users have questions. Well, let’s start there, what are some of these questions? Why would they ask, why do they need answers? We have some common and important questions concerning business intelligence, advanced analytics, data warehousing, and so forth. I got to say, the most common question I’ve ever heard is, where did the data in this report come from? People want to know this before they make an important decision, right? Otherwise, they don’t really know where the data came from.

Every data source has some bias to it, for example, customer data coming from a sales pipeline application is actually quite different from customer data coming out of say, a call center application, right? People want to know, where does the data come from? Also, how did you aggregate and transform it? The mix of data sources will actually have a bias, have an influence on the state of the data as well. People want to know this stuff before they make important decisions about it.

There are the important questions as well, for example, who has used this data? For example, if say, you’re using a self-service data access tool, and you can see where your colleagues have been using certain datasets, that’s actually a positive guidance for you, it can tell you that your colleagues are getting value out of certain data sets, maybe you should look at them yourself, this kind of thing. Who’s used the data is important, a different scenario is data governors can look at who’s using the data and spot inappropriate users and that kind of thing.

Also, people want to know, what’s the quality of this data, and that’s part of a trust value that they’re going to assign to the data. Also, if they have quality problems with the data, this is going to affect downstream work, they’re going to have to do to fix data quality problems. It’s a project planning thing as well. Overall, all the above and other questions go together in determining what’s the level of trustworthiness for the data in question? A solution to help you have really quickly available answers to also have answers that are credible, that make sense to people, then data lineage is an approach to do this.

Data lineage tracks and records the journey that data takes from primary and secondary sources, through a variety of interfaces, and also how it got transformed to repurpose the data for specific use cases and user groups. Data lineage tells you how data was aggregated into target databases. In my world, probably the most common target database for this would be a data warehouse, data marts, but newer stuff like data lakes, and then finally in what delivery mechanisms has this data been used? Things like BI products, reports, analyses, dashboards, and so forth.

Also, data lineage can record other attributes of data, for example, what data domains are these? Subject areas, other categories? What’s the condition of data’s quality, the metadata, its modeling? What’s the advantage? How old is this stuff? Is it fresh enough for me? Finally, in multiple use cases, it can also list things like what use cases and apps the data got used in. Then finally, we’re seeing a lot of data lineage tools that automatically generate a data map from all the above. That’s a great thing because you don’t want to put the map together yourself, that could be time-consuming, so automation for that map is a big deal, and we’ll come back to automation a little bit.

Why do you want to have data lineage? Well, it’s really the negative ramifications of not having it. If you don’t have credible answers through data lineage information collected, users may not use the products, don’t use their data. One of your metrics for success as a data professional is the question, are people actually adopting and using data sets, the reports, the analyses that we put in there? You could be considered a failure if adoption does not work out among users. Again, if they don’t have the trust, they won’t do the adoption. Also, a problem is that some users will compensate for the lack of data lineage information by creating their own data. That way, they at least know where the data came from, even if their low end data sets aren’t the best.

The scariest example of that would be rogue data marts. Data lineage information helps you to avoid that rogue behavior. Then finally, without broad data lineage information, a lot of tasks are slower and inaccurate. For example, developers will take longer, developers may select inappropriate data sources. In other areas, say it’s more of a self-service scenario. That kind of user needs all the guidance they can get. Information about data lineage provides very valuable guidance for them. Then finally, I personally have lived through audits. They’re pretty scary. I know from experience, you want the audit to go as quickly and easily as possible. The longer the auditors linger, the more dirt they find.

You really want data lineage and other documentation to help you speed audits. No matter who’s auditing you. Whether it’s your own management, or a regulatory agency, or something like that. Hey, it’s time for another poll, so get your hand on your mouse, get ready to respond. The question this time is, when a business user asks where a particular data item came from, how confident do you feel in your ability to provide an accurate answer in a timely fashion? You can see four different answers. Please click on one, and then click on the submit button.

I’ll give you a moment to make your choices. [singing] Okay. Let’s look at the responses. Oh, good. If you’re seeing what I’m seeing, there’s a big green bar. Most of you selected, “I’m pretty confident that I can get the answers and do it in a timely fashion.” Almost none of you have no confidence at all, so that’s great. That’s great. It’s a positive thing that you can get the answers, but part of our point is, if you have the automation provided by data lineage tool functionality, and you can do this stuff even faster with greater accuracy. In a lot of cases, the users don’t even need you as a data professional.

They can quite often go to the lineage tool themself, and get this information as a self-service fashion. Let’s move on to some business use cases. I think data lineage increases the business value from a wide variety of analytics investment, as well as raising adoption rates. Also, data lineage can be a real boost to any kind of programs that are tracking compliance, governance, stewardship, data curation, compliance with very specific regulations like GDPR, CCPA, and so forth. For a lot of people, data lineage is actually the proof. They can say, “Look, our data lineage tool, it tracks data usage. Based on that information, we know we have achieved compliance.”

In some cases, you have to prove that you are complying. I mentioned audits earlier, so that is a business problem. Any help you can get to speed an audit is a great thing. Finally, data lineage helps to guide self-service users. From a technology viewpoint, there are also compelling use cases. There’s guidance for developers, also a productivity boost for developers. They’ve just got more information, so they don’t have to create and recreate that information themself, which can be time-consuming for them. One thing I dwell on here. I don’t think yet we’ve mentioned how a lot of times, you have to reverse engineer something somebody else built.

Sometimes it’s that rogue data mart I mentioned, and you get stuck as a data professional trying to figure out, “Where is the data in this thing come from, and how do we fix it?” That sort of thing. In other cases, it’s very sophisticated solution that you inherited from a prior technology colleague. You’ve inherited it, so you have to figure out. The more information you have about data, its lineage, its use case, et cetera, the faster reverse engineering will go. Also, data lineage can help you with continuous improvement. Whether that’s a data quality issue, or say finding redundant data sets emerging them, or finding data that’s just not being used, and maybe it should be taken offline, or at least move to a cheaper form of storage, and so forth.

Finally, from a technology viewpoint, everything we’re talking about today adds up to a heck of a list, doesn’t it? You need automation instead of having to do a lot of this stuff in a manual and time-consuming case. With that in mind, let’s look directly at automation. With advanced tools for lineage, I see many of them can scan data automatically. When I say scan data, I mean they can reach out to a wide variety of data sources and targets across an entire enterprise. What the tool does typically is to create and maintaining a broad data map.

This data map can be extremely useful for a wide variety of stuff, not just data lineage. The automatic lineage map, it saves you a heck of a lot of time as a developer, right? This stuff is automatically created for, I’ve talked to users who’ve done this thing in a couple of hours. Whereas, you could easily put man days into this kind of thing. The lineage-driven data map gives further automation in other areas. It makes reverse engineering even faster. A tedious part of your job quite often is doing mappings from sources to targets. With data lineage, when you have this map, it just makes the drag-and-drop process of doing those a lot better. Having this full information about data also gives you the ability to do impact analysis.

That way you can understand the ramifications of changes before making those changes. Finally, the data lineage map provides a view of all data sets. It gives you a pretty unique big picture, and that’s what the next slide is about. Imagine the big picture of enterprise-scale data as seen through data lineage map. It’s really showing you an inventory of data to be queried, browsed, searched, used in a wide variety of ways. The map also visualizes data structures. It helps you understand, what’s the bigger architectural patterns going on here? What are dependencies among independent data set?

The data lineage map helps you develop an excellent understanding of very complex data environments. For a lot of our members at TDWI, these complex environments are for data warehousing, data lakes, analytics program. There are also operational environments that have complex data environments you need to understand as well. The big ones there would be marketing, especially where you’re doing very complex digital marketing, online campaigns, and also multichannel marketing. The fully modernized supply chain involves a heck of a lot of data for mini sources and its mini-structures.

Sometimes it’s messages, other times it’s documents like XML, JSON,  et cetera. Financial data is equally complex. Furthermore, the data lineage map shows you the current data platforms, and also the new ones in one view. You can see the old platforms, the new ones. This is really important if you’ve been doing upgrades, and so you have a lot of new systems. In particular, if you’re doing a lot of cloud migrations, or you’re just adopting software as service applications. The ability to see on-premises and cloud data in one place can really give you a much deeper understanding of your current inventory, as well as the architecture and other design patterns they made across them.

In my summary here at the end, let’s talk about why automated data lineage is a must-have for BI analytics. Remember, that’s what the title of this webinar claim. I think one of the reasons it’s a must-have is because without credible data lineage information, users will not trust and use your data. In response, some of them may go off and create rogue data sets, or use data in non-compliant ways. When you do have broad data lineage information, many tasks are accelerated in terms of time. Things like data-driven development, like creating new data sets can go faster with a lot more accuracy, be fully documented, and therefore, easily maintained in the future.

Self-service environments have better information to guide the kind of person that’s doing self-service, audits go faster, and so forth. Data lineage supports compelling business use cases. It gives you BI that’s trusted, used, and accurate. Also, it’s a very good thing to have for compliance audit and self-serve. Data lineage supports compelling technology use cases. Accelerates development, prototyping, reverse engineering. Then finally, data lineage is best when it’s automated by that data map I was talking about, and you want to map, it’s automatically generated, but of course, you might want to manually tweak it.

Some of you may already have some information that could be uploaded to expand and contribute to the map. This data linage map can help you with things like mapping from sources to target, doing impact analysis, and just getting the big picture of your enterprise data inventory. All right. Well, Jim. That brings me to the end of my presentation here. Can you segue us to the next speaker, please?

Jim: Thank you, Phillip. Just a quick reminder to our audience. If you have a question, you can enter it at any time in the ask a question window. We’ll be answering audience questions in the final portion of our program. Our next speaker is Mark Horseman from NAIT, the Northern Alberta Institute of Technology. Mark is an IT professional with over 20 years of experience. Mark moved into data quality, master data management, and data governance early in his career, and has been working extensively in BI since 2005. Mark is currently leading an information management initiative at NAIT. Welcome, Mark.

Mark Horseman: Thank you for that excellent introduction. It’s really exciting to be on the call today and sharing our lineage journey with everybody at this TDWI webinar. Thank you for having us. NAIT is a polytechnic in northern Alberta, Canada. Our chief mission is a promise to students, industry, a promise to our own staff, and a promise to the province. We are essential to Alberta. A short while ago, I’m going to go over our management to how we arrived at lineage being an important aspect of our data management program and then get into some of the details of the benefits we’ve realized. A short while ago, NAIT brought me over to kick off a data management initiative.

They had some struggles and then some issues and really wanted to bring clarity to the institutional data assets and trust. The first thing that I did when I arrived at NAIT was drink a heck of a lot of coffee. Talking with all of the stewards, all of the folks in various business units at our institution, and we have about 2,500 staff, so making sure we understand the challenges that people are facing out in the wild. That is one of the most critical things that we did early on in beginning our data management journey. It’s important to understand every issue that people face as it relates to consuming data as part of a decision-making process. We need to understand the issues that they have with data.

As Philip was saying we heard trust, trust, trust. I don’t trust the data. I don’t trust this number. As you can imagine, making decisions on how many instructors we need to teach, how many students, trust is an important factor to make sure that we deliver a quality product and achieve our promises to our staff, our students, our province, and our industry partners. With that, we started a catalog of those issues faced. Where’s the trust-breaking down? What are people doing out in the wild in the various business units at our institution? Then we built what’s called a common data matrix. This is a data management artifact that we can build out.

That helps us understand the key players of data at our institution. Who is responsible for the production, creation, and usage of information. Just at a high level, so that we can build out how all that works. As we build that data management practice out, we need to focus for that. Based on what we’re hearing as issues you can have a lot of different focuses for any kind of data management initiative. People often focus on cyber security or purely on business intelligence or on data quality. Being that trust was a huge factor in what we heard, we felt that it was absolutely critical that we focused on data quality.

That means that artifacts and the nature of our data management initiative is going to focus on people believing the numbers that they see, people having trust in the accuracy of that information that they’re using on their reports. Once that’s all understood and built out, we need to make sure we align to the business. Any data management initiative that you undertake needs to align to something that the business intensely cares about. As you can imagine, higher education, financial sustainability is of huge importance right now. Our ability to work very closely with the financial sustainability team at NAIT really gave us the lifeblood that we needed to build that trust and build those relationships.

As we did that, we defined a process where we would put a stamp on a report, a grade on all the BI artifacts that people consumed. That grade was based on do we have the business rules and definitions understood? Are the reports using the architecture that’s appropriate? Most importantly, data lineage. Where did everything come from? Do we have that documented? Can we confidently say what happened to the data before it showed up on a report? One of the key figures we use in Alberta is called full load equivalent. As how much of a full load equivalent of classes is a student taking in the context of an academic year?

Being able to calculate that number, while it may sound easy, there’s a lot of complexity to that. Being able to explain and show the lineage of that, it’s critical for people to make decisions. Just to dovetail off of what Philip was saying earlier, when you don’t have that trust and that trust was at a low level when I started the journey here at NAIT, when you don’t have that trust, you get rogue data sets out there. We had a lot of different pockets within various business units at NAIT that were maintaining their own data sets for decision making because that trust level wasn’t there.

When I was looking at this and seeing this happen and I’ve done data management for a number of years now and a lot of different vectors, what I really noticed was that this started with good intent and they needed to do something but then it turned into this broad nagging monster that was unmaintainable and whose quality was suspect and people spending huge amounts of time to maintain this data set. Also that they could make the decision that’s critical for the success of their business. What we needed to do was look at why that started and look at bringing them back to a centralized trusted institutional data source. To bring them to that data source, we need to build that trust and data lineage is critical to building that trust.

We knew that there would be a lot of ROI in automated lineage. When looking at lineage and when NAIT brought me in, I was a team of one. [laughs] You can imagine the vast scope of looking at our entire BI environment, our entire analytics environment, and thinking to myself how on earth is it even possible for me to document the lineage of all these things that people use. How is this even achievable? It’s so huge. We have thousands of reports. We have many enterprise systems. We have a fairly complex data warehouse environment. How is this all going to be even possible? It was immediately apparent that the only way we could achieve this and actually bring value to people was to automate it.

Great. We’ve automated it and we’re looking at an automated solution. How do we know the business is going to interact with that data? Again, drinking a heck of a lot more coffee. Whew, I got a lot of Starbucks points, that’s for sure. [laughs] When we’re out there talking with various consumers, we’ve got various business users that are creating their own data sets or using our data sets, and then we can show them, “Hey, this is a solution we’re looking at. This is what it’s going to be able to do for you. How are you going to be able to interact with that data?”

We heard from various stakeholders at our campus that that would be huge for them. They wouldn’t have to spend weeks and weeks and weeks trying to figure out where everything came from or how this happened or how this number changed or why this number changed, that we could have a key resource to help disentangle the complexity of our BI environment. Again, we engaged with the business users and determined that there was a lot of ROI to be had. Speaking of ROI, I’ve just got a couple of specific cases here and I might just launch into storytime. [laughs] What we have actually seen since implementing an Automated solution, it’s very helpful for local BI development. It’s a great way for me to have an eagle-eye view on the entire architecture of our system, especially as it changes and we’ve seen usage outside of our technical team. We’re seeing usage out in business units and we’re seeing usage in our institutional research group. That’s a group that asks hard questions and really interrogates data. There’s a number of things that we’ve done with Automated. I’d like to talk about these three in particular.

One of the challenges we were faced with was creating an integration to one of our enterprise systems. A number of our integrations where we’re passing customer data from one system to another come through our warehouse environment. This gives us more of a hub and spoke type architecture instead of having one enterprise system ham-fistedly mash data into another enterprise system. It gives us a lot more control, especially over the quality of the data that goes into that. Many, many moons ago, I would guess in 2003, 2004, 2005 somewhere around there, we created an export of our alumni to go into our alumni system.

As you can imagine, as a post-secondary institution, donations from our alumni are very important to us, and being able to understand who has graduated and become NAIT alumni is critical to making that work. It was important for us to look at how the old integration worked to start building a new integration. This had been a black box for many years. People had struggled with managing this. In the old implementation, we had a whole set of data streaming out of one system, going into our warehouse, streaming out of our warehouse, going into an access database of all places, and then getting mashed up in that access database to go into another enterprise system at the end.

All of those steps were whew, oh boy. With automated data lineage, we could see what was happening. We could see the exact data that was going out. We could see where it came from. We saw that in minutes, we took like half an hour to say, “We can do this. We know where the data’s coming from.” A task before that would’ve taken us weeks of analysis, we had an answer at our fingertips because it was automatically documented by a lineage solution. That’s one thing, working with the new integration. We’ve got a new incoming president, which is very exciting for us, lots of great new ideas, and a new strategic direction for our institution but part of that requires an ability to trust data.

One of the tasks that we were faced with was the development of a storyboard. What is the story of our student body? How are students interacting with our institution? How does the relationship to the institution change over time? How do they progress? How well are they doing in classes? Do we need to help students ensure that they graduate? Those are all excellent questions which demand an accuracy of data and trust in the source of the data. As we’re developing this storyboard, we’re ensuring that that storyboard is using trusted data sets and we can trust those data sets because they’re automatically documented in our lineage tool.

There is no weird spreadsheet or massive SharePoint site or anything like that, that we have to maintain. It’s all kept in this storyboard and in the automated tool. That has created a lot of trust and a lot of usage of that new storyboard item. The last thing, this is very recent, as you can imagine, as schools are reopening, because we’re a polytechnic and we do a lot of trades education, a lot of our instruction for those types of programs has to be in person. We have welding labs, we have other types of labs, we have health labs, so people learning how to use ventilators. A lot of activity is happening in person on campus.

There are a number of rules that we have related to the relaunch of COVID. One of those is submitting a form every day saying, “I’m going to campus today, and I agree that I don’t have any symptoms of COVID.” Then submitting that every day. There’s a reporting requirement related to that activity. It’s great. We can produce a report, but now we can also produce trust that that report is getting its content from the right source. We can say if somebody questions, “Hey, why isn’t this showing up on the report or I know this person submitted a form?” We can trace back to the source data very easily, and we didn’t have to document the heck out of everything along the way.

Those are my favorite stories. We have many more stories of success that we’ve had with automated lineage. It’s been a huge boom for us. We save anywhere between three and five FTE, from month to month, just the sheer amount of ease that we can get, what used to be complex answers across the institution is just huge for us. With that, I’ll pass it over to Amnon.

Amnon Drori: Hi everyone. Thank you very much, Mark, for sharing your story. I think that it’s really, really interesting to see how organizations are choosing to become more sophisticated and more equipped with available technologies in order to deal with data management and specifically around the BI. It was really insightful, Phillip, for you to share many, many things from kind of the need for automation. I think that listening to both sessions, there’s one thing that is very clear, the complex around data management, the understanding of data in order to hand it over in a trusted way, which is a word that was repeated a couple of times today, to the business user is something that is really critical.

My name is Amnon Drori. I’m the co-founder and CEO of a company called Octopai. Our story is that in the past five years, we’ve looked around what’s going on around data, kind of the three Vs around the data, the volume, the variety, and velocity. We’ve seen very crazy things that are happening around the need to more data and the growing need within business users who need more data constantly all the time. The problem was that we as coming from leading BI groups in insurance and banking and healthcare, given our background, we came to this tipping point that we felt that we cannot do our job properly. We came to the conclusion that adding more and more people would not catch up with the business demand.

I think that even what Mark said, saving five FTEs, it means in other circumstances that the team was acting now as if they’re having additional five FTE. The need for efficiency, the need to become better is something that we are focusing our efforts and we help BI organizations to become better. What I’d like to focus my time is to show you how really this looks in practice. I’m going to use our demo environment to show you how NAIT is kind of mimicking their capability of lineage.

Before we do that, I want to continue what Philip has been using very, very nicely, which is a very quick poll. Take a couple of seconds just to look at the question. The following question, which is really interesting to me, and I’m going to show you results not only on that poll, is that if you need to make a change in a field in the ETL or in one of the maps or within the database, how long typically would take your data team to conduct the impact analysis, just to check that whatever you are about to change is going to be impacted in a particular field.

Is that going to take you minutes? Is that going to take you hours? Is that going to take you weeks or is that going to take you months? What do you think?

All right. We can move on to the results, I’m going to click on that. I can share with you that you’re a very, very good crowd audience that takes anything from hours to a couple of weeks depending on the need and depending on the use case. In some areas it even takes months, but what’s notable here that it doesn’t take few minutes. This is very much aligned with other polls in other locations that we’ve done simply because of the fact that the BI team is being surprised because if I were to ask you what is the use case on which you would need lineage next week or tomorrow or in two weeks from now?

You don’t really know. The need to address business needs is happening on the spot. When you need to plan a certain change, for example, because you want to enrich a certain report with additional data and transit it from the data sources to the data target, this could be a task that can pop up next week. You can never be prepared on time, in timely fashion, in order to respond and this is where the pressure actually starts. Going back to what we decided to do a couple of years back is that we decided to challenge the status quo. We decided to look at how business intelligence landscapes are being managed, which are fairly manual, fairly silo-based.

A lot of professional services involved, a lot of it on-prem and we said, “What if we could do this differently? What if we can think out of the box and do things differently and see what the benefits are?” One thing we came in mind is what if we do not need to change or to manage each individual business intelligence tool? Whatever you have, either it’s an ETL or a database or data warehouse reporting and analytics, and even to make it more spicy, each one of them may be from a different BI vendor. What if I could take all of that? Either I have one ETL, two databases and one reporting tool or in some cases that we experience, you have 25, 30, 40, 50 different systems and you want to crunch them into one central repository.

If you can do that, you can do also a very thorough analysis of that separated metadata. Once we do, we create a kind of Octopai language that understands the metadata from all of these different vendors’ tool and we leverage the power of the cloud. Now, the outcome of that analysis is being shrink-wrapping different products, some of which you can see right here. The data lineage is something that is really really valuable but the set of capabilities are helping organizations in a lot of use cases, in which some of them you see here but also, these are the typical people that are the direct beneficiaries of using automation of their BI, specifically around data lineage to help them with those different set of capabilities.

If I were to say what is very, very unique that had changed from five, six, seven years ago is the fact that our approach is saying don’t bother with your individual silo tools. Have everything within your platform analyzed, cross-platform, that’s number one. Number two, if you want to enjoy those capabilities, there’s a price for it that’s mainly involved in manual work, customization, understanding documentation in place, sharing the knowledge between the teams. What we say is invest 30 minutes of your time to 45 minutes of your time, extract metadata from those tools with a free service of Octopai, and then allow us to analyze.

If you have 5 tools, 20 tools, it doesn’t matter. Within a day, you’re going to get an access to start looking at what’s going on within your BI. Either you used our data lineage, BI catalog discovery, and so on and so forth. Let’s have a product that for God’s sake is going to be very, very easy to operate. I want to show you a couple of cool things within our product. I’m going to go ahead and share my screen. I’m going to use the next few minutes to show you how it looks like. I’m going to share my screen right now. You should be able to see three round circles that looks like this.

What you see here is a centralized repository in Octopai that consists about 389 ETL processes that are shipping data and storage in 2,500 database tables and views and there are 24 reports that are consuming that data that is stored in these 2,500 database tables and views. Now, you can see a bunch of different tools that are from different vendors like Datastage and SSIS and SQL Server and Cognos and what have you. Let’s demonstrate some of the things that have been talked today. Let’s assume that your business user is saying, “Something is missing in my report.” or, “Data doesn’t make sense to me.” or, “Why am I looking at this data report and it’s half empty? Why am I looking at two different reports being named the same but I can see two different data set results?”

Here’s how you do it. The business request question in the BI language means first of all I need to find where the report is being generated out of the set of different reporting tools. The second thing is how it’s being structured and most importantly where data is coming from. Where does the data that has to do with that specific report reside within the 2500 database tables and views and which ETL processes exactly and collectively are responsible for landing the data in that report? That’s the BI language. This is how you do it. You go to the reporting section here and again, this is metadata that had been analyzed from all of these systems, centralized and analyzed.

I’m going to look for a report, let’s say customer products, and as I’m typing in, Octopai is searching for that report for you. Again, I’m the BI. Now, the only thing I need to do is click this button lineage, which will guide me to the exact tables and views and ETLs processes that are related to landing the data on that specific report. Clicking on that and within three seconds, this is the view that you get. What you see here is that this report is actually based on a view and this view is based on data that reside in tables and views right here colored in red and the data lands on these database tables and views by running these ETL processes.

First of all, for the first time, you can see different set of tools and different sets of data process in one screen, but let’s even play around with it. In some cases, this ETL could be the source ETL for that table but is this really the ETL that captures metadata or data from the applications themselves? Octopai will tell you that the answer is no. Actually, if you continue the navigation, you will find that here are the tables that are sourced to that ETL and if that is the table that is sourced to that ETL, are there any additional ETL that are happening prior to landing the data in that table?

Clicking of that and you can continue the journey as far as you want up to the table source of the application. You can have everything at once, you can control the flow and you can do it the other way around. What if you want to see the impact analysis of changing this ETL moving forward? I want to do the lineage forward, not from the report backwards, from the ETL forward. Now you can see a different map. This picture says that any changes here, it may impact not only several reports which we started this journey from, it will impact all of these reports and guess what?

This one is an SSRS and this one, like in the sales report, is a business object. The message is that all your BI is being analyzed and it’s in the palm of my hand, just to be able to understand what’s going on inside. The last thing I want to move on before I hand it over to Q&A is the last poll on my part, asking you the following question, if you can share. Is automation the key to taking your BI operation to the next level? We started this session with this theme and after this session, it will be really interesting to understand your view on the topic. Take another few seconds and share your thoughts.

Great. I’m going to move forward to the results. As you can see, 35% is “Absolutely” and about 37% say, “Yes, I think so.” The majority is actually going towards in favour of automation. I would like to invite everybody if you would like to hear more about the product, hear about the capabilities, the use cases, why others have chosen to become better by leveraging automation, feel free to contact Octopai. We’re at your service. We’ll be more than happy to conduct 15, 20 minutes demo or even discuss your challenges and to see if we have any fit for your needs. With that said, we’re open to Q&A I guess.

Jim: Thank you Amnon. Let’s move into our Q&A period now and answer some audience questions. We’ll start with this one for the group. Is it necessary to do data lineage prior to data governance?

Philip: Yes. Let me step in on that one Jim. That’s a great question and actually, this is a general question involving data governance has come up quite a bit. Real quick background. At TDWI, we’ve seen a lot of organizations be very successful. [chuckles] In fact, we’ve given some organizations prizes for actually creating a data governance program before other data management programs, especially those that involve lots of change because change has to be managed and the government support’s good for that. For example, you typically want to create a governance board before you attempt a large data quality program or a master data management program.

You know, to be honest, I’m not sure about the order when it comes to governance and data lineage, but I would guess that any time you begin a new kind of data management practice, and lineage will be new to many of you, that it does help to have governance in the beginning. Then my final thought here is all data has to be governed, period. All new data management practices need to be involved with governance, period. Listen, Mark or Amnon, do you have any thoughts to add to this?

Mark: You bet–

Amnon: If I can add– Sorry. Go ahead, Mark.

Mark: Thanks. Yes, really to echo what Philip said, for us at NAIT, understanding our data governance group and building that council of information stewards, we called them, really was a benefit for us to understand what the ROI would be before implementing lineage. Lineage is a component of a data management program or a data governance thing and having those two things work together is really critical. Yes, understand who your stewards are, having that council available to help launch your program and understand your ROI makes a lot of sense. Amnon?

Amnon: Right. Yes, I can share that it very much depends on the use case. I agree with you completely, the data lineage is an important part of the data management. Nevertheless, we have clients that either did not implement data governance, or they are in the journey of considering that while their BI needs to be or wants to be sophisticated tomorrow. That’s the power of this automation. If you want to deal with use cases, you can start working with automate your data lineage while you consider or exploring data governance, and then we’ll be incorporated with that.

Mark: We don’t have our cameras on so you can’t see me nodding profusely in agreement with that.

Amnon: Yes.

Jim: Mark, our next question is for you. What automation tool does NAIT use?

Mark: We are using Octopai.

Jim: Okay. Next question, does implementing a data lineage tool cause an impact on the database or cause applications and reports to slow down?

Mark: I can speak to that as well because we’re actively using the lineage tool. What we do is we have an automated process that runs once a week that pulls all the metadata and then brings that out to our user community. We have not noticed any performance issues, and we’re pulling a large amount of metadata from a variety of sources.

Jim: Our next question from an audience asks, “How do you govern data lineage when business stakeholders are building their own Power BI or other reports?”

Philip: Yes, I’ll chime in on that one. Earlier, I mentioned rogue data march, which is a classic problem. We’ve been fighting that one for 30 some odd years. I’m going to really focus on the data set side of this, not so much the tool side of this. When users create rogue datasets, one of the problems in governing them or even applying a wide variety of tool functionality to them is you don’t know where they put that data set. Right? You really don’t. A really simple thing to do, which we see on the uptick at TDWI and this register is in our surveys, we see people using some equivalent of a data lab or a data sandbox.

The sandbox may be part of the data warehouse environment, it could be part of a data lake environment, whatever. The idea is that users can do all kinds of crazy stuff if they want, but they need to store their rogue data sets in a place where they can be found. Therefore, that way, it’s a lot easier to govern these things. It’s a lot easier for IT to step in when they’re asked to improve them and so forth. That’s a simple trick that occurs to me. If anybody else has a better idea, please [chuckles] kick it in here right now.

Jim: Okay. Our last question is for Amnon. Does your product require any professional services to start using it?

Amnon: The long answer is no, and the short answer is no.

[laughter]

Jim: Okay. How long does it take to get up and running?

Amnon: As I explained, one of the most important things that we have in our automation is the fact that the client needs to spend about 30 to 45 minutes extracting metadata locally. Either they do it themselves or they can use certain extractors that we’ve created, which is free of charge to extract the metadata. Once metadata files have been created either supervised or watched by the client, they upload the metadata files to their instance in Octopai and we need one day, 24 hours. Sometimes with a huge number of systems, we need 48 hours and the next thing they know, we schedule a training call for them to start working. 45 minutes from the customer and that’s it. No professional services, no customizations, no interviews, no IT capital whatsoever.

Jim: Thanks. We’re just about out of time here so let me take a moment to thank our speakers today. We’ve heard from Philip Russom with TDWI, Amnon Drori with Octopai, and Mark Horseman from NAIT. Also, thank you again to Octopai for sponsoring today’s webinar. Please remember that we recorded today’s webinar and we will be emailing you a link to an archived version of the presentation. Feel free to share that with colleagues. Don’t forget, if you’d like a copy of today’s presentation, use the “Click here for a PDF” line. Finally, I want to remind you that TDWI offers a wealth of information including the latest research, reports, webinars, and information on business intelligence, data warehousing, and a host of related topics. I encourage you to tap into that expertise at tdwi.org. From all of us here today, let me say thank you very much for attending. This concludes today’s event.

Announcement ! We are happy to share that Octopai has been acquired by Cloudera