Data warehouse optimization
Learn data warehouse optimization techniques to help reduce cloud spend and improve performance at scale.
What is data warehouse optimization?
As businesses migrate to the cloud and establish data warehouses to manage their data at scale, optimizing these warehouses becomes a highly effective way to reduce cloud spend and improve performance.
Data warehouse optimization is the continuous process of improving the performance, efficiency and effectiveness of a business’s data warehouse. This includes adjusting warehouse sizing, optimizing query performance and managing resource utilization to balance cost and performance. An optimized warehouse can lead to efficient provisioning of computing capacity, helping businesses minimize waste while improving the overall performance of their systems.
Techniques for optimizing data warehouses
One of the major challenges Capital One faced when moving to the cloud was proactively right-sizing data warehouses without pulling our data teams away from driving business value to manage warehouses. We came away with the following best practices to best manage our warehouses in the cloud.
1. One warehouse size does not fit all
By scheduling your warehouse to change sizes dynamically depending on the workload size, time of day and day of the week, you pay based on your company’s usage patterns.
2. Set the right auto-suspend time
Reducing the auto-suspend time for a warehouse from 15 minutes to 2 minutes, for example, can greatly decrease idle times and save on costs. Paying attention to warehouse trends will allow you to set an auto-suspend time that helps improve usage and cost efficiency.
3. Configure multi-cluster correctly
At Capital One, we adopted Snowflake as our data cloud because of the near-infinite scalability. When configuring a multi-cluster warehouse, you need to consider how fast a response time is necessary. Significant cost savings are possible if you can afford to wait a few minutes of high load before spinning up a new cluster.
4. Keep different sizes and types of workloads separate
Each team or group has its own requirements and the data warehouse should be rightsized for that team’s needs. Keeping workloads that are different sizes and types separated leads to greater efficiency and cost savings.
How to right-size your warehouses
We covered how the warehouse size, scaling policy and auto-suspend time can affect the cost of running workloads. But how do we determine the rightsize for a warehouse? There are four key metrics to consider:
1. Query size
Understanding the distribution of typical query sizes that run in a warehouse will help determine the best size. You may find almost all queries are small and you do not actually need that XL warehouse you created by default.
2. Data spillage
When a data warehouse runs out of memory while executing a query, the data spillage that occurs significantly slows down query execution and degrades performance. Viewing the spillage patterns can help you identify queries that tend to lead to spillage and how much disk storage is necessary to complete the queries.
3. Query load
This metric helps organizations understand the time required for a given query to run in a data warehouse. If you decide to scale up the size to accommodate, there will be a tradeoff with query times decreasing while costs go up.
4. Queued queries
Reviewing the total number of queries that entered a queued state, usually due to the load on the warehouse, will help with understanding the minimum and maximum cluster settings that are ideal to scale.
Balancing warehouse cost and performance with Slingshot
There are many factors to consider when right-sizing your warehouses and managing these decisions effectively takes time. We built Capital One Slingshot to help businesses optimize their Snowflake investments by finding the right balance between warehouse cost and performance.
Warehouse provisioning
Slingshot’s warehouse scheduling tool allows users to set dynamic schedules using templates to automatically provision warehouses. The ability to tune warehouse sizes and scaling policies, apply schedules and manage inactivity controls to rightsize warehouses based on need means you only pay for the compute you need, when you need it.
Warehouse recommendations
Slingshot’s warehouse recommendations identify when a warehouse is under or overutilized and provide optimized size and cluster settings on a schedule. The recommendation also includes an estimated cost savings where applicable to apply tangible business value to the change. For example, Slingshot may recommend a schedule for a warehouse that estimates a 20.5% per month savings with just a 15-second increase in query execution time.
Optimizing your data warehouse for cost efficiency and scalability
The near-infinite scalability of data in the cloud brings incredible opportunities for new data insights and use cases, but managing all that data can present challenges without proper oversight. Following best practices and leveraging tools to help proactively manage and rightsize warehouses will help you optimize costs, improve efficiency and scale faster.