Automatic Data Dictionary Mapping Using Machine Learning

Is your data going somewhere?


No, we’re not talking about databases going AWOL or data warehouses getting a little R&R at the beach. We’re talking about migrations from one system to another, or combining data from different systems into a single system or warehouse.


These activities are quite common in business. In order to pull them off without tearing your hair out, you need two crucial elements: data dictionaries for all data assets involved, and data mappings from each source to its target.


Data Dictionaries

What is a data dictionary? A data dictionary is a technical representation of the metadata associated with a data object, such as a database, table, or column. A data dictionary contains information such as:


– Database name, user names of administrators and other users, status (such as read-only)

– Indexes and views defined on the database

– Relationships between tables

– Table names

– Column names, labels, data types, formats


Essentially, everything about the data goes into the data dictionary. Uses of data dictionaries are mainly technical: database designers, database administrators, developers, and BI professionals use them for such things as developing applications using a database and combining data from different sources into reports.


In any modern business, data dictionaries are also an important element of data governance. A data dictionary serves as the technical basis of a comprehensive data governance program, because data governance relies on the underlying metadata.


Note the important distinction of data dictionary vs. business glossary, two terms that are often confused. Although it sounds like it should be the same thing, a business glossary is a listing of terms and what they mean in the context of a specific enterprise, and has little to do with the technical aspects of the company’s data assets.

Not sure if you need a Business Glossary or a Data Dictionary?
Our eBook on the Automated Business Glossary can help
Get the eBook


Data Mapping

With data mapping, elements from one data asset are mapped to corresponding elements in another data asset. As alluded to previously, data mapping is useful in situations where you’re integrating different data sources together—think mergers and acquisitions—and those where you’re migrating data from one data asset to another, for instance, a legacy system to a new one. In some cases, integrations and migrations can happen in the same project.


In either case, having good data dictionaries for all data assets is crucial to the success of the project. It’s essential that the data dictionaries are accurate so that the mappings will be correct.


If all data assets followed the same conventions for column names and used the same data types and formats for similar columns (such as date and numeric columns), data mapping would be trivially easy. Real life, however, isn’t that simple. Good data mapping burdens people with obstacles such as:


– Different names and labels for columns containing the same information, such as names or addresses; this is especially prevalent when data sources are in different languages


– Different representations of the same information—for example, a single “Name” column in one system vs. separate “First Name” and “Last Name” columns in another


– The same format representing different things; for example, the date “09/05/2019” means “September 5, 2019” in the U.S., but “May 9, 2019” in some other countries


…and more. Effective data mapping thus requires not just a simple connection between one column and another, but often requires some transformation rules as well.


Automatic Data Dictionary Mapping

The key to overcoming these challenges is the application of automated tools. Automated data mapping tools can take most of the tedious and error-prone manual work out of the process and shorten the cycle time from days or weeks to mere hours.


Modern data mapping tools are more intelligent than ever. Automatic data dictionary mapping using machine learning leverages artificial intelligence tools to make the connections that are difficult for humans to catch, such as the “09/05/2019” example above. Machine learning—the same technology that powers automatic fraud detection for banks and machine vision in robots—is making the data mapping process even easier, leaving humans to resolve only the handful of truly ambiguous cases.


Data mapping is too important to perform manually. Manual data mapping takes too long, especially for complex migrations and integrations, and the tedium can lead to burnout among the team members. With manual data mapping, it’s far too easy to miss important connections or define incorrect transformations. Applying automated data mapping tools can take the frustration out of the process and help see your project to completion.



Is your organization Octopied?

With effortless onboarding and no implementation costs, Octopai’s data intelligence platform gives you unprecedented visibility and trust into the most complex data environments.

Announcement ! We are happy to share that Octopai has been acquired by Cloudera