Snowflake query optimization guide

July 11, 2024

Snowflake is a cloud-based data warehousing and analytics platform that streamlines data operations and helps reduce inefficiencies of cloud computing. But as data volumes and queries required to use that data increases, monitoring and optimizing query performance becomes essential.

Query performance directly impacts data costs because of Snowflake’s pay-as-you-go pricing model. Because of this, effective query optimization is important for optimizing cloud costs. Knowing how to optimize queries in Snowflake will reduce query costs, improve query performance and lead to more efficient use of resources.

Understanding Snowflake architecture

Snowflake's architecture separates storage and compute resources to allow independent scaling of both. This is helpful for enterprise-scale operations with complex cloud resourcing needs.

Imagine a large enterprise is upgrading its financial management systems. These are hosted on the cloud, and several teams are working on different elements of the upgrade. The data storage team likely needs terabytes of space to migrate the old data into, while the MLOps team might need significantly more space when it's time to migrate a large training set. In an on-prem model, the enterprise might over-invest in a large data warehouse early on or run the risk of not having enough storage when the ML team is ready to move.

Snowflake's architecture allows each team to scale as needed, and it expands these advantages across the entire ecosystem. The team developing new AI models, for instance, may not need as much storage as the MLOps team, but it may be hungry for processing power. Independent scaling allows different teams to work at their own pace and grow out their systems without hitting bottlenecks caused by limited resources or the waste of paying for extra services that can't be used during a scale-up. Because Snowflake separates computing from storage, each team can move at a natural pace without waiting for the other elements to catch up or struggling to keep pace themselves.

How Snowflake manages resources and queries

The internals of Snowflake can get complicated, but from the outside, the platform handles queries in a relatively straightforward way. When a query is submitted on the platform, a governing program, which is like a triage system that allocates jobs as efficiently as possible, routes the request to the query optimizer. This subroutine analyzes the request and devises the most efficient execution plan available, which means it figures out the best way to handle the requested data.

Plan in place, Snowflake allocates compute resources to execute the query. These compute clusters are virtual warehouses that dynamically scale horizontally, depending on the immediate workload. This avoids both kinds of waste: bottleneck slowdowns from too few resources and resource glut caused by an overestimation of need. Because its clusters can access each other's data simultaneously, Snowflake enables parallel processing in a shared data architecture.

Queries are just one place where Snowflake saves resources. The platform is also good at managing memory and disk space. Snowflake stores data in a columnar format that emphasizes data compression and efficient storage.

Auto-clustering

Snowflake also uses auto clustering, which dynamically rearranges the micro partitions of a table and ensures that it matches the filter conditions being used. This process ensures that data is arranged in a way that best suits the customer's SQL query requirements and avoids a full table scan to get data. The key here is to ensure that fewer micro-partitions are being scanned to get the required data.

Identifying performance bottlenecks

As efficient as it is for managing data and processing demands in the cloud, Snowflake still has the potential to develop bottlenecks. This is especially true for nascent teams running queries on the platform. It can be helpful to know where the most common pinch points are and how a team might run into difficulties running Snowflake.

Data distribution and query execution plans

Snowflake distributes data across multiple nodes as part of its query execution plans. This brings great economies of scale and savings on bandwidth and data warehousing, but bottlenecks can still occur. Profiling queries and closely monitoring their execution helps identify these as they develop, which allows teams to develop workarounds that integrate with their workflows.

Techniques for identifying bottlenecks

Teams working on the Snowflake platform can develop and integrate feedback and monitoring systems that help keep control over the bottlenecks and streamline production. Developers in the Snowflake community have identified several techniques for spotting slowdowns and problem areas as well as preventing them from developing in the first place.

Resource allocation: Snowflake partitions data in what the program thinks is the most efficient way possible. It's a smart feature, but niche applications or specialized team needs could make the usual optimization plan less efficient. Teams can mitigate this by making changes like adding auto-clustering to the table.
Monitoring tools: Advanced monitoring help teams track query performance in real time and avoid bottlenecks. We developed Slingshot, a tool designed specifically for the Snowflake platform, with this user need in mind. Users can set alerts for specific trouble signs to prevent the development of larger bottlenecks. These tools are also helpful for developing time-sensitive reports and aiding a larger analysis of how queries perform. Routine monitoring is one of the most powerful tools for query optimization in Snowflake.
Query history storage: Snowflake keeps a record of past queries to inform analysis and projection of future performance. Data collected in this way includes query execution times and resource allocation, including the anticipated and actual resources used for storage and computing tasks. Reading over these histories, including the optimizer's history of predicted demand on resources, helps identify bottlenecks and prevent serious issues from developing as the project advances.
Query profiling: Snowflake's advanced query profiling tools are powerful analysis options for review and constructive development. Reviewing the insights provides highlights of common pinch points and the types of processes that are most likely to introduce error, security concerns, inefficiencies and costly slowdowns.

Best practices for Snowflake query optimization

When using Snowflake, there are a few generally accepted best practices to optimize query performance and get better, faster results with low latency and efficient allocation of resources.

Optimizing data loading

Data loading is a surprisingly complex process. Optimizing calls on data before the query runs can dramatically cut the time a query takes, as well as reducing strain on systems architecture. There are several ways to streamline data loading:

Optimized file formats: Data is stored in different file formats, and these can have wildly different demands for data storage and transfer. Teams can use Parquet or ORC for staging before loading into Snowflake, which is a more efficient approach to storage and processing. This speeds up data loading and reduces the call on scarce resources.
Parallel loading: Snowflake's ability to load data in parallel tracks is a gift to teams trying to optimize their data loads. It's possible to divide large and unwieldy datasets into smaller files and run them concurrently to get an overall faster processing speed out of affected queries.
Snowpipe: Snowpipe is Snowflake's proprietary data ingestion service. It works by letting devs load data onto the platform as soon as it's ready at every stage, rather than letting it all bulk up and load as a single difficult transaction.
Efficient data compression: Compressing data prior to transmission is an old trick, but it's a good one. Efficient compression can reduce the size of a data package by 50% or more. Snowflake supports a large number of compression formats, including GZIP, ZSTD and LZ4.
Bulk loading: Snowflake has the ability to enable bulk loading, which helps load large volumes of data fairly quickly. A team can get a lot out of this for initial data loads, when large datasets need to be transferred in a single delivery. The COPY INTO function is especially helpful for this, especially when loading data from an external source.

Utilizing proper data types

Choosing the right data type for a query helps speed up search and retrieval functions. This is because a well-structured query that isolates the right data for a given task reduces the amount of data that needs to be transferred, reducing bandwidth demands and egress fees.

Using clustering keys

Clustering keys define the physical order of data in a table, which can improve query performance by reducing the number of disk reads required. Disk reads are a time-consuming process, and the fewer of those the system performs, the less latency creeps into the process.

Avoiding unnecessary joins

Minimizing the number of joins in queries can improve performance, especially when joining large tables. A table with billions of rows, for example, might take up 500 gigabytes or more. Joining large datasets like this hoard resources and may create bottlenecks as the platform struggles to keep up.

Limiting function use

Every function a team runs takes up some resources, even if the bandwidth and processing capacity is minimal for each query. Using too many functions in a query can slow performance. This is especially likely for teams working with large datasets, such as an ML training set. Teams working in Snowflake should generally avoid using functions unless absolutely necessary and only use them sparingly as the situation calls for.

Snowflake's bulk loading capabilities

Bulk loading is possibly one of the most useful ways to optimize query performance in Snowflake. The platform offers several useful tools for doing this. Automatic clustering has already been covered, as have optimized storage and the COPY INTO function. There are still more native capabilities Snowflake teams can use.

Staging data before loading

Data staging is an effective technique for handling large amounts of data in transit or at an intermediate point in a data transaction. Staging data usually means storing it in a temporary data warehouse where it can be processed before moving on to its final destination.

Users can create a dedicated stage in Snowflake for data before it gets loaded into tables. This is part of the platform's partitioning strategy, which keeps the staging data separated from the production data and protects data integrity.

Validating data

Staging is a good point to validate data and cleanse it prior to load into Snowflake. This preliminary step reduces the amount of extraneous data sent over the network and reduces the overall demands on the network.

Bulk partitioning

Ideally, the partitions being used in Snowflake can divide the data into frequently used columns. These are based on the frequency of use, which keeps information together in an easily accessed arrangement that makes searches simple and easy by reducing the overall amount of data needing to be scanned.

Advanced query optimization techniques and tips

Beyond these basic techniques for efficient processing and storage of data, Snowflake supports a range of more sophisticated approaches to query optimization. A couple of these techniques are unique to the Snowflake platform, and they're also available through solutions like Slingshot.

Materialized views and caching

Materialized views help Snowflake users improve query performance by reducing processing overhead with precomputed aggregations for queries that use the same subquery results repeatedly. They are automatically maintained by Snowflake and are updated by a background service when changes are made to the base table.

This is similar to caching, which is the temporary storage of frequently accessed data. Large search engines have used caches of common and popular searches for years, and Snowflake caching works in a similar way to permit local storage of the most in-demand data.

Monitoring and ongoing optimization

Monitoring and ongoing improvement are parts of every successful data migration and management plan. AI teams are familiar with iterative design and continuous optimization. Snowflake's analysis tools come in handy by providing a granular information breakdown that's full of insights and suggested areas of improvement. Capital One Slingshot can help by providing enhanced visibility into cost, performance and usage of Snowflake.

Optimize Snowflake queries with Capital One Slingshot

Optimizing your queries is one of the most cost-effective ways to improve the efficiency of cloud services, especially for users with large data demands. Snowflake is one of the most helpful tools for doing this, and Capital One Slingshot's Query Advisor builds on that by identifying inefficient Snowflake query patterns and recommending improvements.

Get in touch with our team for a personalized demo and to learn more about how Slingshot helps optimize cost and performance in Snowflake.