Data-driven organizations are a bad idea.
Okay, now that you’re wondering whether you had too little sleep last night and just misread that last statement, or whether WE had too little sleep when we wrote that last statement, let’s explain.
Using data to drive your organization is wonderful. But data, at best, can only be a powerful vehicle, or a reliable GPS system. You need someone behind the wheel doing the driving in order to get anywhere!
Well, that’s obvious, you say.
And it should be. Except sometimes we call organizations “data-driven” when really the data is driving them up the wall. They’re managing the data; they’re governing the data; they’re so busy being on top of their data that they’re not actually using their data. The data is driving them, but it’s not taking them anywhere.
How can that situation be fixed? Enter DataOps.
How can DataOps help here?
The goal of DataOps methodology is to enable Data to drive Operations: to have a direct, constructive impact on the business operations of the company.
DataOps realizes that the purpose of having data is to use the data. All data management, governance, provenance, what-have-you is just a means to that end.
So what is DataOps? DataOps is an approach to best practices for data management that increases the quantity of data analytics products a data team can develop and deploy in a given time while drastically improving the level of data quality. Data analytics teams take the perspective that they are delivering data analytics products – not just services – to data consumers. Products should be ready-to-consume, easily accessible and responsive to the consumers’ needs.
When DataOps principles are implemented within an organization, you see an increase in collaboration, experimentation, deployment speed and data quality.
What DataOps best practices put you on track to achieving this ideal? Let’s take a look.
Six DataOps best practices
Continuous pipeline monitoring with SPC (statistical process control)
This DataOps best practice borrows straight from lean manufacturing. SPC is the continuous testing of the results of automated manufacturing processes. Results (i.e. products or product components) are checked to make sure that they do not deviate in a statistically significant way from the expected results. For example, if your production line is supposed to produce car tires that are 16 inches in diameter, and your SPC tools detect a statistically significant number of tires that are 18 inches in diameter, an alert should go out right away to your engineers, along with automated policy safeguards like pausing production.
SPC tests can do the same thing for the data flowing through your pipelines. Continuous DataOps metrics testing checks data’s validity, completeness and integrity at input and output. Input testing is especially important when your pipelines use data from an external supplier. Business logic is also monitored. If there is any statistically significant deviation from what is expected, an alert goes out to your data engineering team, and suspicious data is prevented from populating analytics products until it is investigated.
A decentralized, flexible data processing infrastructure
Lean manufacturing is also the source for this DataOps architecture best practice. The lean manufacturing core principle of “Just-in-Time” directs manufacturing processes to produce only what is needed for the next process in a continuous flow. Otherwise, you end up with wastes of time, materials, manpower and machinery. “Just-in-Time” manufacturing increases production while optimizing resources.
“Just-in-Time”’s application to a DataOps pipeline is well expressed by DataOps originator Lenny Leibman: your infrastructure needs to match your workload. Processing jobs waiting in queues, servers kept online just in case they’re needed… all this is a workload-infrastructure mismatch that DataOps aims to eliminate.
Ideal DataOps architecture is usually best created in cloud environments, especially cloud native environments. The flexible, automated ability to scale up and down and use only the processing resources that are needed at any given time maximizes productivity while minimizing waste.
Automated workflows for data product creation, testing and deployment
Did you just have a spectacular new idea for a data analytics product? Did your data consumer just tell you about a visualization that she really could use in making an upcoming critical business decision?
The DataOps way of publishing new products or updates is called continuous delivery, and it relies heavily on automated processes to do all the grunt work. Need a sandbox environment that’s as close to the production environment as possible? Click a button and an automated process does all the heavy lifting to create it for you. Think your product is as good as it’s going to get (until the next update, anyway)? Click a button and a whole battery of automated tests spring into action, testing for bugs, coding errors, misconfigurations, potential conflicts, vulnerabilities and more. Did everything check out? Your new analytics product is automatically released into the wild.
The automated workflows of a DataOps strategy make it possible for you to get from concept to runtime quickly. Can you say “same-hour delivery”?
Reusable data product components and standardization
If you want to optimize your car production, you should avoid having to reinvent the wheel (literally) each time you make a car. A wheel should be a standardized part that you don’t have to think twice about before you incorporate it into a new car model.
The same goes for data analytics products.
DataOps-oriented development tools enable data product components and processes to be saved, shared with and reused by anyone in your organization. If you’ve created a useful reporting process or dashboard segment, no one need ever waste time reinventing it.
A storehouse of reusable components also goes a long way toward standardizing your organization’s data products and ensuring that they play nicely with each other. (Because resolving conflicts between data products and processes is not an optimal way to use valuable development time. Much better to avoid the conflict in the first place.)
Easy-to-use collaboration and feedback mechanisms
If the end goal of your data analytics development is for your consumers to use the data to make decisions that promote your organization’s growth (a tenet of the DataOps approach), then it’s going to be very important to collaborate closely with your consumers. What do they need? What do they not need? What do they love in your latest delivery? What do they hate? What is the one tweak that would make that good data analytics product into a great one?
Effective collaboration and feedback requires forums and formats that are easy to use. One of the most conducive forums for collaboration to take place is wherever data users look to find your data analytics products. This is commonly a data catalog or a marketplace. Look for a data catalog or marketplace platform that enables collaboration, communication and feedback right there on the data product entry.
Comprehensive metadata that supports data product and process organization
How do users find the data products that they need within data catalogs, marketplaces or any other repository? Through a search function! This is true whether they are business users looking for finished data analytics products or development users looking for reusable components for their own projects.
The identification and categorization that enables effective search is based on metadata. If your metadata management is thorough, and you are using a marketplace or data catalog solution that can leverage it, your search function will empower your users and enable them to quickly find the best products and processes for the job at hand.
Smooth self-service data access is critical to both developer and user productivity. After all, if the user can’t find what he’s looking for by himself, who is he going to go bother… err… ask?
‘Nuff said.
Don’t just manage data. Use it!
Don’t mistake being in control of your data for being truly data-driven.
If you can get a powerful wild horse to stand relatively calmly and not do anything dangerous or unexpected, you are managing it. Congratulations. That’s certainly much better than hanging on for dear life to a bucking bronco.
But a horse-drawn carriage is only worthy of its name when you can get those horses to drive you to a destination of your choosing.
Be a data-driven organization worthy of the name. Go implement the harness and reins of DataOps, and your data will take you far.