4 Common Data Integrity Issues and How to Solve Them

4 Common Data Integrity Issues and How to Solve Them

Seen on an online dating profile:


Deep, thought-out analyst type searching for a partner with integrity. A significant other who is completely reliable. A soulmate who can help accurately assess situations and make the right choices in life’s big decisions. 


Or maybe that should be an “online data-ing profile”?


Seeking: data with integrity

Integrity isn’t just a coveted trait for the partner of your dreams. It’s also a critical trait for the data assets of your dreams. 


What is data with integrity?


Data integrity is the extent to which you can rely on a given set of data for use in decision-making. Increased data quality, accessibility, alignment across systems, and context all contribute to increased data integrity.


Where can data integrity fall short? And how can you fix or – even better – prevent these issues from occurring? 


Let’s see what it takes for you and your data of integrity to live happily after.


Too much or too little access to data systems

Call it the golden mean, or call it a catch-22: data that is too available will reduce data integrity, but so will data that is not available enough.


Why?


Remember, the definition of data integrity is the extent to which you can rely on a given set of data for use in decision-making. If too many hands can stir the pot of your data, intentional or accidental corruption becomes much more likely. Result? Unreliable data driving your decisions. 

Happy Homer Simpson GIF - Find & Share on GIPHY


On the other hand, if your data systems are as guarded as Fort Knox, your data might be the gold standard in reliability for decisions… except no decision maker can actually get to it in order to rely on it. 


The FDA’s public database of warning letters to organizations under their supervision contains many instructive examples of organizations that stray (wildly) from the golden mean.


In one example: “Our investigator found “System/Administrator” as the only user role for your (b)(4) software. There were no restrictions on deleting or modifying data for this user role. (b)(4) was used for assay and identity testing of finished thyroid, USP API from March 2021 to November 2021.”


(By the way, for those of you who have never experienced the joy of perusing FDA warning letters, (b)(4) is used to replace trade secret or confidential business information and (b)(6) is used to replace personnel or medical files information.)


Or, in the same letter: “Manufacturing master batch records held in electronic form on your company’s shared drive do not have restrictions on user access. Your quality unit personnel stated that there are no restrictions for any personnel with login credentials to access new and obsolete master records. Our investigator observed during the inspection multiple versions of batch records were utilized for API lot production.”


Would you want to rely on whatever thyroid treatment that lab is turning out? 


I think not. 


What’s the solution for data integrity access issues?

Data integrity best practices dictate data systems with sufficient controls to prevent unauthorized access or changes to data. 


Simultaneously, data should be discoverable, searchable and accessible to anyone who truly needs it for their organizational role. Each role should have exactly the access level they need: no more, no less. 


Optimizing data access for maximal data integrity requires a data governance solution that allows for fine-tuning access permissions, as well as a data management solution including comprehensive, automated data cataloging and data discovery.


With these data integrity standards, when there’s a decision to be made, both the data’s level of reliability and level of access will facilitate solid, sound decisions.


Inconsistent data definitions

Anyone composing an online dating profile knows (or should know) that there are certain terms to shy away from, because they are way too conducive to multiple interpretations.


“Spiritual,” for instance. Does that mean you’re into Bible study? Ouija boards? Or that you spend hours in the woods communing with nature? 

Kyle Mooney Snl GIF by Saturday Night Live - Find & Share on GIPHY


Does looking for someone “accepting” mean you want someone who will appreciate your multiracial family – or someone who won’t get down on you for your bad habits? 


If you’re going to use terms like this, make sure to specify exactly what interpretation you intend, so you and your reader are on the same page.


Data assets and processes often suffer from similar inconsistency issues. If one source of data uses one definition or calculation for “Customer Lifetime Value,” and another source uses a different definition or calculation, but still calls it “Customer Lifetime Value,” the inconsistency will reduce your ability to understand your enterprise’s situation and make effective decisions.


Organizations that have grown by mergers and acquisitions are particularly susceptible to this cause of reduced data integrity.  


What’s the solution to data integrity consistency issues?

Get everyone in your organization on the same page when it comes to understanding and defining your data. 

Confused Schitts Creek GIF by CBC - Find & Share on GIPHY


Tools like a data catalog keep everyone in the enterprise coordinated when it comes to any data asset: the same business definition, the same technical info, the same data owner and steward. It’s the one-stop shop where everyone in your enterprise buys. And when the data catalog is automated, it becomes self-updating, making it always current and never out-of-date.


Other data management tools and policies can enhance the way your data is processed, stored, combined and used – the very things that impact data consistency as it travels through your data landscape.


And if you can’t avoid a situation where two different departments have two different definitions of “Customer” or “Product,” at least be explicit about the interrelationships: where the definitions overlap, where they differ, and how you can get them to play nicely together.


Incomplete data records

Lack of completeness can occur in one of two ways:

  • Entire records can go missing
  • Some records are more complete than others


Entire records going AWOL is the more obvious problem. Take this FDA warning letter as a rather extreme example:


“Your QU failed to ensure good documentation practices in your facility as evidenced by:

• A Batch Manufacturing Production logbook was found with pages torn out

• Destruction of (b)(4) API batch record for batch (b)(4) that contained manufacturing and analysis records, despite this batch remaining in U.S. distribution.”


But when all records are extant, but some have more data than others, that can cause its own data integrity problems. Take a CRM that combines multiple data sources to produce one record. Because of the disparity in data sources, Anne’s record has her name, email, phone number, address and profession, whereas Bob’s record has only his name and email.  This lack of completeness in the overall combined data asset will potentially skew any sales, marketing or customer support decisions that use it as a basis. 


What’s the solution to data integrity completeness issues?

First off, do not keep your only copy of your records in a logbook capable of having pages torn out. 


That will fail any data integrity assessment with flying colors.

Fail Criminal Minds GIF by CBS - Find & Share on GIPHY


Assuring the data’s physical integrity – the ability to store and retrieve data in its complete and accurate form despite any force acting on it – is the foundation of data integrity. 


Establish a framework and policies to support the physical integrity of your information, including:

  • Data backup, recovery and restoration measures
  • Protection against bugs, hackers and other cybersecurity threats
  • Data security education and training for your teams


In the case of incompleteness within a data asset, like the above case of the CRM records, being aware of the issue is the first and most important step. Sometimes the data can be made more complete by enrichment with external sources. Even if that’s not a possibility, the awareness of the data’s incompleteness should help to reduce skewed decision-making. 


Inaccurate data

Wrong names. Outdated numbers. Dirty data. Ick.


If the dating site profile of Adriana Travis, age 26, actually belongs to Jane Brown, age 45, you can bet that some interested suitors are going to be pretty disappointed should they go all the way to meeting her in person. 

Kyla Pratt Online Dating GIF by CallMeKatFOX - Find & Share on GIPHY


Much of inaccurate data is due to human influence on data assets. Most of such influence is unintentional, like mistakes in data entry, but malicious actors like Adriana… err… Jane can play a part as well. 


At other times, errors are due to the automated processes through which the data is transferred or transformed. 


However it happens, the chances of making the right decisions are significantly reduced when they are based on wrong data.


What’s the solution to data integrity accuracy issues?

Step number one, especially if some of your data sources rely on manual data entry, is training for relevant personnel on data entry best practices and how to avoid data entry mistakes. It may sound superfluous, but it’s not.


Once you’ve addressed the human side of data integrity, data integrity controls and constraints address database integrity and information integrity in structured ways of storing or viewing data. As data is loaded into the database or processed for an integration flow, it is checked against a certain set of rules. For example, a field that is supposed to hold “ZIP CODE” will only accept a value that is a string of five or nine integers. Anything else (e.g. a fraction, a string of letters, a string of four integers, etc.) will invalidate the entry. 


Another type of data integrity control is the use of unique primary keys to identify and retrieve individual records. Making sure that each unique customer entity in your CRM has a unique customer ID number that must be associated with every instance in which records relating to the customer appear in your data systems does wonders to eliminate or reduce data duplication and entity confusion.


These constraints filter any data that doesn’t meet the requirements and prevent it from ever making it into the data storage or view, or flag it for a human to check and correct. 

Stop Right There Green Bay Packers GIF by American Family Insurance - Find & Share on GIPHY


In addition to the controls and constraints that ideally keep inaccurate data from contaminating your data environment, tools like data lineage solutions enable your data team to retroactively assess the validity and accuracy of your data. 


A data lineage solution will show the entire path of any data point from its point of origin in your data landscape through every system it has traversed, along with every transformation and integration that happened to it along the way. If an error occurred somewhere, data lineage helps you to quickly discover and remediate at the root of the matter. 


Congratulations

You were seeking data with integrity. You put in the requisite effort – through tools, policies, controls and education – to find it. 


May you analyze happily ever after.

Is your organization Octopied?

With effortless onboarding and no implementation costs, Octopai’s data intelligence platform gives you unprecedented visibility and trust into the most complex data environments.

Announcement ! We are happy to share that Octopai has been acquired by Cloudera