What is a Data Catalog?
A data catalog is a marketplace that organizes all the data assets in a company’s information landscape. Each data asset’s entry in the data catalog includes definitions, descriptions, ratings, data owner and steward, and more, making it simple to search for and identify the data you need for any given purpose.
Practical Uses and Benefits of a Data Catalog
Your company probably has multiple types of data consumers who will get their own type of benefit from the data catalog. Let’s take a look at a few of them:
Self-service BI users
Your typical self-service BI user is the interface between IT and business: not quite as techy as IT, but BI-savvy enough to put together their own reports and analytics (often to be consumed by the non-self-serving business users).
Common self-service uses of a data catalog include:
Data discovery and evaluation
Self-service users need to know what data is out there so they can build reports. To build an effective sales-focused report, for example, they need to know what data assets your company has that relate to sales. With an enterprise data catalog, your self-service user would use the search functionality to search for terms relating to sales. The catalog information about each data asset would give them the overview they needed to see if this data will serve their purposes.
Without a data catalog, it is much harder, if not impossible, to track down all the potentially relevant data assets. Plus, in order to evaluate the data asset’s suitability, the user needs to actually access and look directly at the data and/or spend time contacting and questioning IT/BI staff about the data. The process is about as effective as ordering a shirt from Amazon just to find out what the color is. You ordered, you waited, you opened the package… OH, it’s chartreuse. Hmm… I guess I’ll need to pack it up and ship it back and order another mystery shirt. Maybe that one will be a color I like.
With a data catalog, it’s all right there. You can know what color the shirt is before you order it. Novel, right?
Clarification and documentation
No matter how good your catalog is, sometimes your self-service BI user needs to touch base with subject matter experts in order to answer a question he has about the data.
If this conversation can take place within the framework of the data catalog, the question will need to be asked once – and only once. No user will ever need to ask it again because it will be documented within the notes for that dataset and available for every user to see.
It will also be obvious whom to ask, i.e. the owner/subject matter expert for that dataset should be defined right there in the catalog entry.
And even if the subject matter expert for this dataset left the company a month ago and has gone off the grid, the invaluable historical knowledge she possessed about this asset has not vanished with her. The data catalog preserves tribal knowledge even among companies with high turnover.
Without this centralized communication, questions get asked again and again by different users (or by the same one, who forgot the answer from the last time he asked it). Additionally, precious time is inevitably wasted trying to figure out who on the IT side actually knows the answer to the question. And in an industry with high turnover, it will be a toss-up as to whether there IS anyone on the IT side who can answer this question.
Business users
Leveraging the data catalog for business users holds just as much value as it does for the average self-service BI user by empowering business users with true independence in creating business value by using the data BI teams work so hard on providing, most commonly business uses the data catalog for:
Finding reports
BI landscapes are growing more complex, containing multiple tools and often even multiple reporting tools. With an enterprise data catalog, a business user can easily locate all reports across all tools that have to do with sales, say, or that relate to customer retention.
Understanding reports
At the risk of stating the obvious: in order to derive benefit from a report, you have to know what it’s talking about – and what it’s not talking about. A business user with access to a data catalog can easily check up on which datasets were used in a report, what details about the data the report includes, and understand the scope of the data and any constraints on the dataset or on the report itself, they can also see who is relevant to collaborate with regarding the report if they find a need for that and then collaborate directly in the catalog once again providing preservation and context of tribal knowledge and transparency for all data citizens.
Avoiding duplicate reports
When your business user can easily see what reports have already been created and are available for use, he doesn’t have to waste time reinventing the wheel – or asking IT or self-service BI users to reinvent it for him.
Inquiring minds
No matter their department or designation, any data user can leverage a data catalog to get the answers to questions such as:
- Where should I look for this data?
- What does this data represent?
- Is this data relevant and important?
- Where is this data coming from?
- Who is responsible for this data?
- How can I use this data?
- Who else is using this data?
By implementing a data catalog, you empower your data citizens with independence in using data, preserve tribal knowledge about the data and help your organization with data-driven initiatives.
If you’re now all revved up for your very own data catalog, how do you go about building one?
How to Build a Data Catalog
So, you could tell a junior staff member to go through every single data asset you have and write down all of the above specs.
Warning: only do that to an employee you want to leave the company as soon as possible.
If you value your time (and your junior staff members), here’s how to really build a data catalog:
1. Identify your data assets – and which metadata you want to record for each data asset
Before you start gathering all the data assets and information from their metadata, plan it out. What are you going to include in your catalog? Metadata relevant to a catalog includes:
- Technical description
- Business description
- Technical info (e.g. the ETL process or logic performed on this dataset)
- Asset type
- Responsible parties (e.g. owner, steward, SME)
- When was it last updated
- Topic tags (e.g. compliance, regulation, line Of Business, related projects)
- Rating option
- Approved for use?
- Sensitive?
- What other assets is this data associated with?
The average BI environment has over 100K assets. With all that metadata for each asset… do the math – did we mention the metadata keeps updating constantly?
Can you appreciate why a human being might go off the deep end if he had to do this manually?
Let’s take a moment to thank the heavens for automation.
Had a moment? Let’s go on.
2. Set up the data catalog framework
You could program your own framework, or you could go with an existing tool. Octopai has done the legwork (all 8 of them) on this one.
Whether you are building your own framework or evaluating an existing data catalog, keep these important functionalities in mind:
Searchability and filtration
Powerful search and filtration capabilities are essential to making your data catalog usable. The search function is how your self-service BI and business users will access your (tens of thousands at the minimum! of) catalog assets. The search filters are how they will pinpoint the assets they are interested in.
Efficient search and filtration is also important for the team members responsible for managing or maintaining assets.
Make your search functionality intuitive. The more your data catalog’s search and filter functions resemble the search and filter experience of an online marketplace, the better for your users, especially the less technical ones.
Data Asset Evaluation
A user should be able to get a good idea of whether a given data asset will meet their needs or not from the information available in your data catalog. User ratings, managed annotations, tags, sensitivity of data, and responsibilities are all important components of effective evaluation.
Automation
We’ll get to this in number three below because really, it deserves its own number.
3. Use an automated catalog solution to survey your entire BI landscape and pull out the existing metadata for each data asset
That’s enough planning. Now it’s time for action!
The next step in data catalog creation is harvesting the metadata of your assets from across the different tools in your BI environment, including ETLs, databases, and reporting tools. An automated data catalog solution is the tool of choice for this. The metadata is automatically extracted from your BI systems, analyzed, and then without any effort on your part, the data catalog is populated with all of the existing information about the data assets (emphasis on existing).
4. Enriching the automated documentation
All done?
We wish, but automation isn’t *quite* there yet. You will most likely need a real, live human to review your data catalog assets and enrich the definitions of the assets with relevant information to help data consumers be independent in using the data.
Obviously, this doesn’t happen for 100K assets all at once. How can you make it happen without overwhelming your employees on the one hand, or neglecting the job on the other?
Three strategies we’ve seen work for successful data catalog onboarding leverage:
New projects – Make it a standard part of a new project to document its data assets in the catalog.
Change requests – Ensuring that when closing change requests, the data catalog is updated for the specific asset that the change request addressed.
Data Owners and Stewards – Assign SMEs to assets and make it their responsibility to update the metadata of those assets.
5. Link a data lineage tool to your data catalog framework
Let’s say you’re a data steward responsible for a data asset and a business user has reached out to you questioning the data’s accuracy. Since you’re responsible for the technical aspects, it is crucial to gain visibility into what happened to the data as it flowed through the BI landscape. The most efficient, accurate way to do so would be using a data lineage tool, preferably an automated one. By integrating a catalog with data lineage, you gain one-click access to the end-to-end lineage, shortening the time it takes to trace what happened to the data and providing business with quick and accurate answers thus building trust in the data.
6. Ongoing maintenance and upkeep of your data catalog
Congratulations! You have a data catalog. Now you can release it into the wild and forget all about it…
Except data catalogs aren’t wild beasts; they’re domesticated animals. They need data catalog management, maintenance, upkeep, and care. When a data catalog is automated, the data assets are automatically updated according to the latest metadata, keeping the catalog relevant, accurate and up-to-date.
A health dashboard within the data catalog can be a major help in keeping abreast of the data catalog’s level of completion and areas that need more attention.
Data, here we come!
A data catalog is the ultimate window shopping experience for data. All your data-driven employees, from business to self-service BI users to management, can easily get an idea of what data is out there, what it represents, what it can be used for, and how it should be used.
Unlike a consumer catalog or an online shopping site, no one we know spends their free time browsing through a data catalog fantasizing about all the data they could have… but maybe they’re just too embarrassed to admit it.