High-level insights. Low cost.
That’s the data dream, right?
Deciding to migrate your data from legacy, on-prem databases to a cloud-based data warehouse like Snowflake is often a step in the right direction.
But just a step in the right direction won’t get you to the finish line. How you prepare for your Snowflake migration, how you conduct the migration, and how you leverage your new data warehouse post-migration: that’s what will get you to your dreamy data paradise.
Let’s take a look at some concrete steps to set you up for a Snowflake data migration that will deliver better insights at a lower cost to you.
See what your data flow actually looks like; then create a plan for how you want it to look
When you compare legacy database systems to modern databases, you can’t help being reminded of the comparison between planned and unplanned cities.
Aerial photograph of Kolkata, India shared on Reddit
Your legacy systems are, more often than not, unplanned data cities. They’ve grown organically over years of use and changing needs. They often contain workarounds or have bits cobbled together. Dead ends sit where you wish there were throughways (data silos, anyone?).
When you are migrating SQL server to Snowflake, or any other modern database system, you have the opportunity to create a planned data city. Built from scratch and laid out all at once, planned cities have a good shot at combining aesthetics with functionality and wise use of resources (plus some of them look really beautiful from space). A planned move to a modern database system can increase functionality and decrease operational costs.
But the key to a successful planned city is… wait for it… planning!
The first stage of smart planning is to take stock of exactly what you have and where you are right now. Map it out. How exactly does data flow through your legacy systems? Which of the pipelines carry data effectively, which are redundant, and which dead end? Automated data lineage is invaluable for creating this visual map of your current system’s innards.
Now you can start dreaming – with both feet on the ground. Where does the current map need to be fixed? Where can it improve? Develop a map of your ideal, streamlined, optimized data flow. Use this map as a Snowflake migration tool to plan your migration and design effective data pipelines for the Snowflake system.
Keep the map to guide you and measure your migration efforts against. Plans and blueprints are wonderful, but it’s only when reality measures up with them that they’re meaningful. At any given point in your Snowflake migration, Octopai’s automated data lineage solution can provide a detailed full visualization of your data flow within Snowflake, including pipes and procedures. And unlike some tools for Snowflake data lineage that only show lineage within Snowflake, Octopai shows end-to-end lineage from data source, through Snowflake, all the way through to reporting. Keep your finger on the pulse of your migration so that should it get off-track, you can remedy it quickly.
Do data housekeeping
Years ago, I volunteered at a secondhand charity clothing store that accepted donations from the public. (Warning to potential secondhand charity clothing store owners: if you accept donations, everyone and his dog will drop off their old clothing, no matter what condition it’s in.) I – and a group of volunteers – once spent the better part of a day knee-deep in garbage bags full of clothes, evaluating each item as to whether it was actually good enough to put in the store. If it was stained, torn, outdated or otherwise unlikely to sell, we resigned it to the discard pile.
Your spanking new modern data warehouse doesn’t need dirty, antiquated or otherwise irrelevant data.
Discard. Throw out. Dispose of. Downsize. Chuck.
It will take you a significant amount of time to do a comprehensive evaluation of your data environment and decide what is good enough to migrate and what can be left behind. But it’s well worth it. Snowflake charges separately for storage and for compute time. Extraneous data will cost you more on both counts: you’re taking up storage with unhelpful data, and you may end up including it in your queries, lengthening your compute time.
Automated data lineage can make this evaluation process faster and more accurate. By tracing data assets back to their source and forward to analytics and reporting, you will see clearly which data is questionable, unused or redundant.
Constructing such a map manually would take so long – like weeks, but more likely months – that the map would be out of date by the time it was finished. But with automated data lineage you can create the initial map quickly (it may actually only take the better part of a day!) and automatically renew it regularly, providing you with clear, up-to-date direction on exactly what (and what not) to migrate to Snowflake.
Take the time to understand your data
Even if you make the initial map lightning-fast (hooray!), the subsequent planning needs to be done with thought and deliberation. Data assets are not old clothes where you can reasonably decide in the space of five seconds whether to keep or chuck them.
While some data assets might be obviously irrelevant and unused, for others you may need to do a little digging to find out who uses them, what they’re used for, who is responsible for these assets and what dependencies exist between them and other assets.
A data catalog, if it exists within your organization’s data environment, is invaluable here. Take advantage of all the information already in your catalog to understand any given asset, its importance (or lack thereof) and its position within your data landscape. If a data catalog includes ratings, reviews or other data quality information, that can give you good direction in deciding whether a particular asset is worth migrating. If an asset does not appear of sufficient quality to be worth migrating, but there exist processes or pipelines that depend on it, the “similar assets” or “related assets” feature found in a good data catalog’s entries can help you locate a potential substitute.
(If your organization does not yet have a data catalog, a migration is actually a perfect opportunity to create one. In any event, you will be reviewing and working on many data assets, collecting information about each one. Centralizing that information in a data catalog will enable all stakeholders to leverage the insights both during the migration and post-migration.)
Train users in the specifics of how to use Snowflake efficiently
Snowflake’s pricing model of charging separately for compute time can provide flexibility… but can also rack up charges if your data team doesn’t know how to optimize their queries and data warehouse compute settings.
Your data developers should be aware of the patterns they use to resolve SQL differences between your legacy database and Snowflake. Some patterns can cause expensive queries; they need to know to avoid those. Additionally, they should understand how to effectively set up and scale data warehouses and the optimal way to load data to maximize resource use at the lowest cost – or whether it makes the most sense to let Snowflake automate all of those decisions for them.
Your data analysts, who will be running most of the queries across Snowflake’s platform, need to understand the cost and impact of different types of queries – and which kind of queries NOT to run. They should also be aware of the way to optimize queries, whether by using sampling when it would meet their needs, or by using an external tool like automated data lineage and discovery to pinpoint the location of the needed data, minimizing the need for exploration queries.
Give targeted access permissions
Even if you’ve taught everyone to drive, not everyone needs the keys to the car all the time. Being judicious about who you give the keys to will decrease the chances of the system being inappropriately used.
Snowflake enables admins to restrict account access to specific IP addresses and to set network policies on a per-user basis. Only grant access to users who need that access in the immediate or near future in order to do their job. Ensure that users with access have undergone query effectiveness training, have understood and integrated the training, and continue to use Snowflake wisely.
Let it snow!
Your data migration to Snowflake won’t be done tomorrow. After all, even planned cities aren’t built in a day. But if you follow the best practices above, it will be done wisely, effectively, and in a manner that will deliver cost-effective data-driven insights for years to come.