Driving cloud value through cost optimization & efficiency

Learn how to optimize cloud infrastructure, reduce costs and improve efficiency with best practices for cloud cost optimization.

Presented at AWS re:Invent 2024 by Jerzy Grzywinski, Senior Director, Software Engineering and Brent Segner, Distinguished Engineer at Capital One.

In the last decade, we’ve seen the conversation shift from ‘moving to the cloud’ to ‘thriving in the cloud.’ With federated access to near infinite resources, the cloud offers a great opportunity for fueling innovation – it can also act as an anchor if it is not well managed. Optimizing costs and ensuring efficient cloud usage are at the heart of maximizing the value of operating in the cloud.

At AWS re:Invent 2024, we presented on how Capital One evolved our FinOps framework over the last 5 years, heavily leaning into driving efficiency at scale, to monitor, benchmark and optimize compute usage across our cloud environment. Our approach to cloud cost management ensures that teams can maximize efficiency, reduce waste and allocate cloud resources effectively while maintaining performance.

Cloud optimization best practices: a multilayered approach

When we think about positive cloud cost optimization outcomes, it’s important to have an understanding of the full stack. The cloud infrastructure layer is most commonly correlated with efficiencies, but it shouldn’t stop here. At Capital One, we have expanded our focus into the software layer to better understand the impact on cloud computing infrastructure design decisions, as well as identify areas of code optimization that can't be solved with hardware.

Taking into account utilization, sustainability and performance, we’ve found that a multilayer strategy that targets both infrastructure and software is key to deriving maximum cost efficiency from our cloud investments.

Explore #LifeAtCapitalOne

Feeling inspired? So are we.

Layer 1: rightsizing & optimization for cloud infrastructure

At the cloud infrastructure layer, our goal is to match the best fit instance with the workload. To do this, it's important for us to have an apples to apples comparison across different cloud resources, instance types, sizes and generations. CoreMark is an open source tool that helps us accomplish this benchmarking. CoreMark standardization allows our tooling to provide rightsizing recommendations that more closely align compute utilization requirements with instance allocation.

Some lessons that we have learned through our experience include:

  • Scaling efficiently: One key insight we gained through this process is that bigger isn’t always better in terms of EC2 sizing. Scaling vertically with larger instances can sometimes backfire, leading to performance problems due to physical hardware limitations. Instead, it set a best practice to scale horizontally with smaller instances, which often results in better performance, resilience and cost efficiency. 

  • Generational gains: Even though newer generations of EC2 instances often come with a higher per-hour cost, they also come with significant performance improvements. Our tools provide developers with the data they need to make decisions about upgrading to newer instances while simultaneously reducing instance sizing to achieve cost savings and performance improvements. This gives them a way to use the latest EC2 technology without overspending.

  • Accurate instance selection: It’s important to note that not all instances handle certain operations the same. With CoreMark, we’re able to make sure that we aren’t just putting the right size instance, but also the right type of instance. By getting to the right instance selection, there are not only operational and financial benefits, but also a sustainability benefit. That’s because if you’re able to get an instance that provides you with the right value, speed and performance, you can potentially be more energy efficient. This is additional context we’re layering into our tooling and interactions with developers. 

Layer 2: enhancing cloud efficiency through software optimization

When we look at the software layer, there are two areas that we focus on: 

  • Software decisions that have a strong correlation with infrastructure decisions.

  • Software decisions that impact efficiency no matter what infrastructure decisions are made.

To understand what we mean when we talk about software decisions that have a direct impact on our infrastructure decisions, let’s look at a real life example – expecting cost savings, one of our teams decided to migrate a processing job to Lambda. However, the costs actually rose. After investigating, the team determined that cost impact was driven by a Python-based Pandas library that limited processing to a single thread. Configuring more CPUs to the Lambda had no impact on run time and cost of execution.

The team transitioned to a Rust-based Polars library to allow for multi-threading, which resulted in a 50% performance improvement. With faster processing, Lambda run time dropped resulting in 2x reduction in cloud spend.

Examples of our second focus area – software decisions that impact efficiency no matter what infrastructure decisions are made – include running unnecessary, duplicative computations or making unnecessary API calls. That’s because dispensable, poorly written software that drives computation will make the hardware look busy without it actually providing business value. This type of software optimization requires the mapping of waste and value to each line of code, written independently of what the infrastructure metrics might be indicating.

GPU cost optimization for AI, ML and cloud efficiency

With the growing emphasis on AI and ML, optimizing GPU cost and performance has become a key priority. While much of the discussion around infrastructure and software optimization has traditionally focused on CPU-based applications, the same foundational principles apply to GPU efficiency. However, conventional metrics like CPU and memory utilization do not provide a complete picture of GPU workload efficiency, necessitating the use of additional insights.

Metrics such as GPU power consumption and thermal measurements offer deeper visibility into potential inefficiencies, helping to identify misconfigured hardware, improper instance sizing or suboptimal model performance.

By leveraging a broad spectrum of telemetry data and fostering close collaboration with AI platform owners and data scientists, we can fine-tune both the infrastructure and the AI models, ensuring maximum performance, efficiency and cost-effectiveness.

Beyond the tools: building a culture of cloud efficiency

Having advanced tooling and knowledge of real-time performance optimization is only half of the equation. We understand that technology alone isn't enough. Establishing the right culture that encourages efficient engineering through accountability, clarity and incentives is key. That is, establishing measurements that technologists are held accountable for, are aware of and understand and celebrated for using. We want to provide our developers with automated solutions that make it easier for them to do the right thing.

Our secret power is the culture of optimization that we’ve fostered for our technologists. You need a combination of people, processes and technology to succeed in the modern cloud landscape. That’s why we continuously strive to empower our people, encourage continuous improvement and create an environment where everyone is invested in cloud efficiency. 

Bringing it all together: a strategic approach to cloud cost efficiency

At Capital One, our vision for cloud optimization centers on a multilayered approach, advanced tooling and focus on developer empowerment offer an excellent blueprint for any organization looking to maximize its cloud investments.

Our own cloud optimization journey is an ongoing process of learning and innovation. Based on our experience, we encourage those on the cloud optimization journey  to prioritize the following principles:

  • Understand the needs of your users and your business: Understand your user base and figure out what you can do to serve their needs rather than your own interests. Develop a deep understanding of what's actually being done within the application itself and how you can better address those needs. 

  • Develop a strategy and use supporting tools to execute: Take learnings and key observations back and incorporate those into tooling. Use key observational items to create visualizations so teams can gain insights.

  • Measure what matters most: If you try to measure everything, you end up measuring nothing. Determine the key things that will have the most value for the development teams and surface those for measurement. 

Learn more about Capital One Tech and explore career opportunities

New to tech at Capital One? We're building innovative solutions in-house and transforming the financial industry:

 

 

---

This blog was authored by Jerzy Grzywinski and Brent Segner.

Jerzy Grzywinski is a Sr. Director of Software Engineering at Capital One.  Jerzy leads Capital One’s enterprise FinOps organization which brings together engineering principles, tooling and enterprise strategy to maximize the value of the cloud. Jerzy combines 15 years of engineering experience with passion in finance to develop tools, architecture patterns and champion ‘Good engineering is efficient engineering’ standard across 2,000+ Capital One Tech teams.

Brent Segner is a Distinguished Engineer for FInOps at Capital One. As a Distinguished Engineer, Brent leverages his background in cloud architecture, data science and finance to drive the cloud cost optimization strategy within Capital One. With years of experience with both public and private cloud infrastructure, he has developed a deep technical understanding of how to identify inefficiencies and how to implement solutions to address them.