If you’re a mystery lover, I’m sure you’ve read that classic tale: Sherlock Holmes and the Case of the Deceptive Data, and you know how a metadata catalog was a key plot element.
No?
Well, I’ll try not to give too many spoilers.
In The Case of the Deceptive Data, Holmes is approached by B.I. Guy after his quarterly report to management is charged as being inaccurate and misleading. The honest, hardworking Guy is astounded. He is losing sleep at night over the concern that he’s been framed.
Holmes, after taking a long puff of his pipe, agrees to take on the case. After an initial investigation, he calls Guy back to his office on Baker Street. “I must reassure you,” Holmes says. “It is, of course, possible that you have an enemy – and that we shall find out. But it is eminently possible that you were exposed to inaccurate data through no human fault.”
He goes on to explain:
Reasons for inaccurate data
Integration of external data with complex structures
Big data is BIG. Enterprises often use data sources originating outside their organization, including data sets from the internet, the IoT, industrial sources, and scientific sources. Some of these data assets are structured and easy to figure out how to integrate. Many others are rich, unstructured data sources like documents and videos. The less structure, the trickier it is to fit in, build with, understand, and leverage.
Multiple, uncoordinated data systems
Even when the data originates from within your enterprise environment, it’s all too easy for things to get confusing. Siloed data or data processes that are “put back together at the end” are a surefire way to get data conflicts or inaccuracies.
When your CRM talks about “conversions” and your analytics suite talks about “conversions,” are they speaking the same language? Maybe they have different definitions of conversions, which would certainly lead to metrics that don’t match up. Or maybe they share the same definition, but they each calculate it differently. Same end result: a numerical Tower of Babel.
Asymmetrical data
Another issue with multiple data systems is that even if everyone starts out on the same page, it’s not simple to keep everyone on the same page. A change to one system is not always reflected across the board, leaving one or more systems with outdated data or definitions.
Inaccurate source data
Sometimes data is just wrong. Someone manually typed it in wrong. Someone made a mistake when copying or integrating it. Someone accidentally mapped row 3 to column G when it should have been mapped to column H.
The light on the horizon
The furrow in B.I. Guy’s brow uncreases ever so slightly. “Do you really think it’s one of those?” he asks hopefully.
“We shall know soon,” promises Holmes.
Within a week, Guy is sitting in an armchair at 221B Baker Street, looking much more relaxed. Holmes has told him that he has identified the source of the deceptive data as being benign. It may have been a product of human error, but it was certainly not a product of human intent.
“But how can I prevent this from happening in the future?” persists Guy.
“It’s elementary, my dear Guy,” smiles Holmes. “Let me tell you about metadata and cataloging.”
Holmes’ brilliant solution for data accuracy
A metadata catalog, Holmes informed Guy, addresses all the benign reasons for inaccurate data.
Metadata is the descriptions, definitions, and contextual information about your data. Take a typical column of numerical data, starting with 5672, then 879, then 3427, and continuing for another 2000 fields. What do those numbers signify? Customer LTV? The house number in a billing address? A product SKU? Without knowing what the numbers signify, they’re useless.
Metadata gives the numbers definition and context (e.g. Revenue per customer in 2021 (USD)). Metadata lets you know the source of the numbers and how they were calculated. Metadata tells you when the numbers were last updated and who updated them.
But metadata can’t be leveraged unless it’s organized and accessible. Enter the metadata catalog.
What is a metadata catalog?
A metadata catalog (more commonly referred to as a data catalog) is a tool that leverages the metadata about an organization’s entire body of data assets to create entries that concentrate all relevant information about a data asset in one place. Each entry includes definitions, descriptions, ratings, responsible individuals, and more, making it simple to search for and identify the data assets you need for any given purpose.
“So why exactly would that improve my data accuracy?” asked Guy.
“Ah,” replied Holmes. “Here’s why:”
Metadata catalog = clearly defined data
That’s the essence of a metadata catalog: defining data assets. With all the information at your fingertips, and easily searchable, there need be no more confusion as to what a table, chart, spreadsheet or any other data asset represents.
Metadata catalog = a coordinated, single source of truth
An automated metadata catalog will update itself on a regular basis by reviewing the metadata throughout the organization’s information landscape. No more asymmetrical data: implementing a data catalog and metadata management tools creates a holistic approach that is a key factor in maintaining consistency.
A metadata catalog equipped with active metadata management (based on AI) will also monitor data assets for signs of quality or consistency issues, and alert the data owners or managers so they can take action.
Metadata catalog = easily accessible information about data use and accuracy
A high quality metadata catalog will have usage statistics: what is this data asset used for? What other assets is it used with? How many times has it been used?
If you have all this information at your fingertips, and you have the choice between a data asset that’s been used 10,000 times for reports that are similar to the one you’re creating, versus a data asset that looks comparable but has only been used 75 times, common sense would dictate using the former.
A metadata catalog will often also have user reviews and ratings of each data asset, which is also helpful in data accuracy evaluation.
Metadata catalog = concentrated tribal knowledge
A comprehensive metadata catalog solution will provide tools for collaboration among data users, data owners and subject matter experts. Ideally, each catalog entry will have a place where questions can be asked and answered can be preserved for anyone with a similar question in the future. The ability to learn and share information about your data catalog metadata (would that be metametadata?) is key to accurately appraising and using the data.
Case closed
B.I. Guy implemented a metadata catalog and worked happily ever after… until the day when he tried to access a data asset entry and saw that the metadata had disappeared.
But for that, you’ll need to read the best-selling sequel: Sherlock Holmes and the Case of the Missing Metadata.