This is an entry from a data catalog pre-automation:
Source: The British Museum’s Teaching History with 100 Objects
This Sumerian data catalog entry (c. 2900 B.C.E) relates to a dataset of barley rations paid to workers. The entry features the data asset description (i.e. the stalk of barley symbol and the circular numeral signs) and the data owner (i.e. the sign of the official responsible for the transaction).
This data catalog didn’t need automation. It was perfectly reasonable for an individual to manually manage a Sumerian data collection (especially if you paid him enough barley).
Jump forward about 5000 years.
Data has gotten a whole lot more complicated.
Why do we need a data catalog, again?
The more complex your data environment is, the more you need a system of organization and management.
A data catalog is a tool that organizes all the data assets in a company’s data landscape, for each one including definitions, descriptions, ratings, responsible individuals, and more, making it simple to search for and identify the data you need for any given purpose.
With a data catalog, you can see what data you have and find what data you need. You can have a single source of truth that spans source systems and silos.
But before you can reap, you must toil. Putting together a traditional data catalog isn’t simple.
What’s involved in building a data catalog?
Data catalogs are based on metadata – the data about your data, and all that metadata is used to populate your data catalog (which some consider to also be a metadata catalog).
How do you collect the relevant metadata and get it where it needs to go?
Well, you could ask some subject matter experts or professional services to manually survey your entire BI landscape, organize everything in spreadsheets, eliminate duplicates and resolve conflicting metadata, and then use the results to construct your data catalog.
But with the average company as the proud owner of over 100K data assets, that’s going to cost you a whole lot of barley.
Not only do you need a way to organize your data assets (i.e. the catalog), but you need a way to organize the construction of your data catalog.
Enter automation.
You can harness the power of automation to tackle one or more of the following data catalog automation tasks:
Harvesting data assets with automation
Get the combine warmed up. It’s time to go harvesting.
The metadata of your assets is spread across the different tools in your BI environment, including ETLs, databases, analysis, and reporting tools. Often these tools are siloed, creating a broken telephone situation when it comes to your metadata. Your company can have a considerable amount of documentation about each data asset, but if it is siloed, any employee looking will only see part of the picture.
An automated data catalog solution breaks down the barriers between data silos, automatically gathering metadata from across your entire BI landscape and integrating it into a coherent, unified picture that can be used by both Business and IT departments.
Keeping the data catalog up to date
You have a data catalog – hooray! Time to sit back and relax!
Not so fast. Metadata can change. Data assets are added, removed, and updated frequently. Without automation, you would need to manually update the data catalog every time there is a change in information relating to any of your 100K data assets.
If you feel a headache coming on, banish it by reassuring yourself of the godsend of automation. An automated data catalog platform will periodically recheck the metadata of all the data assets throughout your BI landscape – and update your data catalog accordingly.
Value beyond data catalog automation
Having a dedicated data catalog solution includes many benefits that go beyond automation:
- A centralized place where all data citizens have access to the data assets and their definitions
- Easy-to-use interface with no limitation on the number of simultaneous users
- Preservation of tribal knowledge
- In-context collaboration about the data assets
- One source of the truth about your organization’s data
What to look for in a data catalog automation solution
The baseline of any solution you’d consider should be the ability to automate data catalog creation and updates. (If it can’t do that, it’s not worthy of the name “data catalog automation.”)
After you’ve established that a data catalog offering has that baseline, look at how comprehensive its data asset entries are. Does it only list the definition and description of the asset? Can you preview and/or access the data set from within the entry? Does it have role-based information, like data owners, stewards and subject matter experts? Does it also support user-generated content, like ratings and reviews? Are there communication and collaboration tools built into the catalog?
Even if the catalog entries are extremely comprehensive, their value can only be realized if the catalog interface is user-friendly. An automated data catalog that you actually want users (especially non-technical users) to leverage should be as intuitive to use as an online marketplace.
There are a whole host of other factors that could weigh in when you’re in the process of selecting data catalog software, so consider and weight them according to the particular needs of your organization.
On your mark, get set, automate!
Setting up a data catalog is a labor-intensive big deal. But it gets infinitely easier when you harness the power of automation.
So get an automated data catalog solution on the job, and then make like a Sumerian worker and kick back with a mug of fermented barley.
Beer o’clock. Yeah.