5 ways to improve data efficiency on data platforms
Best practices to optimize your data platforms while maximizing business value.
Gone are the days when IT costs were fixed and accounted for ahead of time. As more companies move their workloads to the cloud and develop cloud-native initiatives, they face the challenge of managing costs in a fluid and changing environment.
Companies are not only focused on moving to the cloud, but how to operate their data efficiently once they get there. More than 50% of cloud decision-makers are hiring new staff or training existing employees to optimize cloud spending, according to a 2022 Forrester study. And experiencing tangible value from the cloud remains a challenge. More than 50% of business executives have not realized the benefits of the cloud, such as cutting costs and driving new revenue, according to PwC’s 2023 Cloud Business Survey.
At the 2023 Snowflake Summit, this past June, we Capital One had the opportunity to share how we optimized our data usage and removed inefficiencies on the cloud.
The importance of data efficiency on data platforms
Data efficiency refers to optimizing the balance between cost, speed, innovation and governance when operating in the cloud. Unlike the fixed costs of on-premise technology stacks, cloud costs fluctuate and adjust based on need. Businesses must optimize these costs by balancing scaling workloads and the speed of innovation with the appropriate governance and controls.
This is a tough balancing act. If you go at the speed of your business, you risk increasing costs. But if you put too much weight into controlling costs with heavy-handed governance, you may lose the ability to stay on top of innovation.
Because finding the right formula of speed, efficiency, innovation and governance is difficult, many companies still choose to manage their databases with an on-premises mentality. For example, a company on the cloud may still run all database matters through a centralized team, which can create bottlenecks for the organization.
How we improved data efficiency at Capital One
At Capital One, we take a balanced approach in order to unleash value from data at the speed required by the business while improving efficiency.
First, this involves determining if a use case is the right fit for a data platform by assessing the cost implications beyond storage, including the costs of loading data, computing power, onboarding users and query types. Evaluating the potential value of the use case also helps decide whether the data platform is the right fit.
Next, the focus should shift to removing waste and inefficiencies from your data platform. At Capital One, our efforts to remove waste and inefficiencies led to significant savings in multiple areas, including savings on lower environment, application optimizations, cost compute and query optimization.
Through our own experiences, we identified 5 focus areas for companies to improve data efficiency:
-
Warehouse optimization
-
Query optimization
-
Dataset optimization
-
Application optimization
-
Environment optimization
Let’s walk through how we can address each of these five inefficiencies to ensure optimal use of data platforms and help you get more business value from your data.
1. Warehouse optimization
In many cases, one of the most expensive areas of cloud computing for a business will be the compute costs to run queries in a data warehouse. Warehouse optimization is important for all of us and we found success with the following strategies:
One warehouse size does not fit all
You don’t need the same warehouse size running the entire time. Warehouses should be scheduled to change size based on the query load on a given time or day of the week. A third-party tool or Snowflake’s internal scheduling feature will allow you to dynamically change your warehouses.
Set auto-suspend to prevent idleness
Nobody wants to spend money on a warehouse that is sitting idle. Scheduling a warehouse with auto-suspend times in smaller increments (e.g., reducing the auto-suspend time from 15 minutes to 2 minutes) will decrease idleness and save on costs. There is a tradeoff as you will lose the cache. However, we have found after an initial decrease in performance in the beginning queries, the warehouse resumes running at optimal levels.
Configure multi-cluster with the right scaling policy
In Snowflake, you can configure the multi-cluster feature to kick in when the concurrency increases. The options are standard, in which a second cluster dynamically activates when the maximum limit has been reached based on the workload or system load, or economy, which avoids spinning new clusters to save credits. If your application can afford waiting more than five minutes of high load before the second cluster spins up, you can gain significant savings with the economy setting since a new cluster will not automatically kick in.
Keep different types and sizes of workloads separate
Each team has its own requirements and should right-size its warehouses to match those requirements. This is where striking the right balance between performance and cost is crucial. Keeping small queries and large queries in separate warehouses will help avoid unnecessary costs. Also, keeping warehouses suspended for as long as possible will be an effective way to save costs.
2. Query optimization
Performing queries is fundamental to all data-driven organizations today, but they can be costly. There is also a large difference in costs between large and small queries. When you focus on optimizing your queries, you can save your company a lot of money.
Some of the key ways to start optimizing your queries include the following:
Alerting mechanisms
Alerts can give you notice that a query is running beyond the time you desired or when it should not be. You don’t want to learn the cost of a query for the first time from the monthly bill. Some queries may run for a certain number of days and time out. The user, unaware of the problem, may run the query repeatedly in hopes of success. An alert provides a necessary feedback loop to take appropriate action rather than repeating errors unknowingly.
Education on queries
At Capital One, we also learned the importance of educating users on queries that negatively impact performance. Integrating best practices into your systems or finding ways to identify these queries before they run on Snowflake is instrumental to saving money up front. A few of the inefficient queries we have seen include: single row inserts, select * from a large table, Cartesian product joins, deep nested views and spilling to disk.
Educating users on these particular queries can help avoid unwanted repeat behaviors. To equip users for success, there should be a feedback loop in place so users can be aware of the inefficient query and fix it. The hope is there will be a domino effect in your organization with the user passing on knowledge to their teams, which should lead to greater efficiency in running queries across the company.
Reduce data scanning
Another important way to optimize your queries is in reducing the amount of data scanned. In Snowflake, the more bytes a query reads, the longer the time and greater the cost of the query. To improve query performance and reduce bytes, focus on having your queries use appropriate filters in Snowflake to help narrow down the data that is processed and scanned. Also, you may want to avoid certain types of joins or break a large query into multiple pieces.
Take notice of spillage
Additionally, data stakeholders should pay attention to queries that trigger spillage because of warehouse size. Spillage is an indication of a bad query that should be rewritten or the need for a bigger warehouse to run the query. In the latter case, the organization will incur greater costs to keep running the query in the smaller warehouse.
3. Dataset optimization
The longstanding practice of optimizing datasets, or the idea of improving a dataset’s performance and usability, is just as relevant in the cloud as on premise. The added significance in the cloud is dataset optimization’s impact on saving companies money.
How data is loaded matters
Most of the time when we talk about data costs, we think about storage. But how you load the data and the frequency with which you load datasets greatly impact your costs. One of the lessons we learned is to avoid loading datasets in near real time when users access them infrequently, such as only once a month. We also saved money by making adjustments on loading, such as breaking files into small chunks before loading.
For the greatest efficiency in loading data into Snowflake when the dataset is not large, we recommend using Snowpipe, a data ingestion service in Snowflake, over using a dedicated warehouse that copies the data. Once loaded, any transformations to the data should be kept to a minimum to avoid expensive updates and merges. There is also no need to load all of your datasets into one XL warehouse. Instead, make sure to rightsize your warehouse for each load job.
Keep risks and costs top of mind
One of the biggest draws of cloud computing is storage, specifically the cost effectiveness of storing data in the cloud instead of maintaining a physical space. However, data in the cloud for many organizations is in the petabyte scale, making storage one of the biggest risks for quickly escalating costs. A recent survey found that 23% of IT leaders believe cloud storage accounts for more than half of their cloud costs, while 69% of respondents said storage takes up more than a quarter of their total cloud costs. Approaching data storage with risks and costs in mind can help improve data efficiency:
-
Establish a business-driven retention policy: A data retention policy for each production dataset will help your organization define how long data should be retained based on business needs and regulations.
-
Restrict data retention and time travel in sandboxes: We recommend disabling time travel on staging or sandbox tables and making sure you have an aggressive retention policy, such as time limits of a few weeks to months, in sandboxing.
Avoid large time travel
For tables that are particularly volatile and change frequently, avoid implementing extensive time travel capabilities that require significant storage.
Choose the right data model
Your data model, which serves as a blueprint for organizing and structuring data in Snowflake, increases in importance with more complex datasets and directly affects performance. The tables you create should be aligned to your company’s consumption patterns. In some cases, for example, denormalized tables may be the right approach with storing related data together for faster execution of queries.
Employ search optimization
Another way to improve dataset performance is with search optimization techniques such as indexing. Although not ideal for all use cases, this option is best for situations in which you are often looking for single-digit rows in your queries or experiencing the feeling that you are finding a needle in a haystack. Your overall query performance will be much faster with search optimization.
Utilize auto clustering
Auto clustering can also enhance your datasets by making your table more presentable. It should be taken on a case-by-case basis so you can understand what to look for and at which moments you should turn the feature on or off.
Take advantage of materialized views
Making use of materialized views can save on costs while optimizing your query performance. For the right use cases, these precomputed results of a query can provide faster query performance across queries using the same logic and increase data transparency.
Follow best practices for loading and storing data
The decisions you make on how you load data and where you store data, whether in Snowflake or outside of the platform, are crucial. Here are a few important best practices:
-
Understand access patterns and match to table load frequency: If you are loading data every five minutes and spending $30 for each load, but that table is only queried once a month, a lot of money is going to waste. Analyze access patterns and make decisions on how frequently you should load the data knowing that not everything needs to be executed in real time.
-
Use an external table if access is limited: If after analyzing your queries you realize access is infrequent, the best solution for you may be to keep your data in your data lake and create an external table.
-
Choose between semi-structure or structured: Determine whether to store your data as semi-structured or structured, which can depend on whether you choose to store data natively in Snowflake or break it into normalized tables.
4. Application optimization
Snowflake is not just a database. You can also build your cloud-native applications on the platform. We learned some best practices on how to do this efficiently.
Better task management can reduce costs
We found that task management was key to reducing costs in building an application. With many tasks, or scheduled jobs, in lower environments when developing an application, the first step is to identify and organize your account’s tasks with tags. Next, define standards for task priorities. Also remember to suspend tasks during off hours such as non-business hours and weekends and resume them automatically during business hours. Encourage your developers to suspend tasks and run them on demand.
Be watchful of external data checks
In optimizing your applications, be mindful of when your system is processing row-level data and performing external checks. Merge commands, in which update, insert and delete commands can all be achieved simultaneously, are useful for data synchronization and updates. But they can also be costly. In a situation involving large datasets, for example, the process of comparing and matching data between source and target tables becomes time consuming.
Identify queries that typically increase cloud compute charges
One important thing to remember about Snowflake is that metadata and meta activity costs that are typically not charged have a cloud service cost threshold. Specifically, when the amount of compute required to service requests exceeds 10% of your compute costs, you incur a charge in the full amount. In this regard, certain activities around application development have a greater likelihood to generate cloud services and costs that push you over the 10% threshold, leading to an unwanted jump in the monthly bill.
Identifying activities that generate cloud service costs and bring you to the point of exceeding the 10% threshold is crucial. These activities often include:
-
Copy command
-
Extensive use of RESULT_SCAN
-
Frequent DDL operations or cloning
-
Running extensive show commands
-
High frequency of simple queries
-
Single row inserts and fragmented schemas
-
Extensive querying of information_schema
-
Queries with many joins or Cartesian products, huge SQL queries and large in-list
5. Environment optimization
At Capital One, we realized we needed to be vigilant about controlling costs according to the development environment. In lower environments, we formerly allowed for the creation of any warehouse size, which ended up costing us. We now practice strict policy governance in lower environments including warehouse sizes that cannot exceed extra small and small, allowing only up to two clusters and providing only economy as an option for the scaling policy. Today, we understand the importance of keeping in sight the costs we are incurring in lower environments.
How we maximized data platform efficiency at Capital One
To reach a high level of efficiency in your data platforms daily, your organization must maintain visibility into your cloud usage and costs, get alerts that bring you insights and notifications where needed and receive recommendations for remediation. At Capital One, we also obsessively track waste in our environment and make adjustments based on insights and automate where possible. Also, we recognize that user behavior and workloads are changing constantly and as a result require continuous monitoring to avoid unnecessary expenses for our business.
In modernizing our data ecosystem to operate at scale in the cloud, we developed tools that helped us automate and reduce many of our own inefficiencies. These internal tools eventually became Capital One Slingshot, a data management SaaS solution we built to help us automate and reduce many of the inefficiencies in our own data platform so that we could focus on obtaining value from our data and move at the speed of the business. Learn how Snowflake can help you maximize your Snowflake investment.