Using machine learning to build more resilient mobile apps

How machine learning models help provide auto failover capabilities to enhance the customer experience

Capital One Tech

November 18, 2020

By Mihir Shah, Director, Software Engineering

Usually, when cloud technology and machine learning are mentioned in the same sentence, the topic is how the cloud enables machine learning via scaling, availability, and convenient pay-per-use models. And while that’s true — the cloud enables the collection and use of data that underpins most machine learning algorithms — the relationship between cloud and machine learning can also go in the other direction, with machine learning helping improve the resiliency and optimization of cloud-based apps.

At Capital One, we’ve deployed machine learning models to improve the resiliency of our Capital One Mobile App, which is a vital part of our millions of customers’ financial lives, particularly during the pandemic as many of our lives have moved increasingly online.

Digital banks operating outside the confines of business hours and brick-and-mortar branches need to maintain the trust, reputation, and credibility of their consumer financial products. That’s why Capital One strives for anywhere availability on the Capital One Mobile App. That’s also why we’re pioneering efforts to deploy machine learning for added resiliency in the cloud.

The Cloud is the Heart of Capital One’s Infrastructure

Capital One this year became the first U.S. bank to report that we had exited legacy data centers and moved all in on the public cloud with AWS. About eight years ago, we realized the technological landscape was undergoing a shift. We knew that as emerging technologies matured, customers would have rising expectations about the kinds of real-time, on demand, personalized experiences their banks should provide. If we only made incremental changes to our tech stack, we’d have difficulty keeping up with demand, let alone setting the curve, in our industry.

We also knew this transformation wasn’t going to be easy. Capital One is America’s largest direct bank, with 70 million customer accounts. We are the second-largest financial institution auto-loan originator and the third-largest credit card issuer in the US.

Additionally, and this is no small factor in anything we do, but we operate within a heavily regulated industry with additional requirements around the safety and security of the products we build.

Between the size of the company, our regulated industry, and the general maturation of the field of cloud infrastructure it took eight years to fully transform from a traditional bank that used technology into a technology company that does banking.

As a technology company, we needed to become great at building cloud native software that could take full advantage of this post-data center world. To get there, we scaled our engineering organization until we hit nearly 11,000 technology associates, with 85 percent of that workforce being engineers. We moved from a waterfall to a hundred percent agile model for delivering software. And we re-architected our data environment to build the foundation for machine learning across the company — from call center operations to back-office processes, fraud, security, and digital experiences.

But migrating the flagship Capital One Mobile App to the cloud was no small task. It was decidedly not a “lift and shift” project. Transferring operations to the cloud demanded a full re-architecture of our orchestration systems, including many of the support tools around them.

As we looked to leverage our modern tech stack to continue meeting rising customer expectations, one crucial part of our progress was using machine learning to improve the Capital One Mobile App’s resiliency and optimization.

Building Resiliency into the Capital One Mobile App

First things first - what is resilience? Resilience is an outcome, a property of an app or a piece of software that has been well architected, tested, and has the right level of automation to provide seamless user experiences. Ideally, we want resilient applications to have the ability to adapt to the changes seamlessly, identify and recover from failures without human intervention, and to do so consistently and not at the cost of user experiences. Building in the cloud is conducive to building resilient apps due to the regional autonomy, redundancy, and data insights that cloud providers offer. But none of this is automatic, and it has to be purposefully architected for.

There are many ways to deliver and architect for that resiliency — and because we are an open-source first organization, we opted for tools from those forward-leaning organizations such as Netflix Eureka and HashiCorp Consul, Hashicorp Nomad, and Fabio help us achieve our resiliency goals for the Capital One Mobile App.

Between these open source tools, our own proprietary internal tools, and our orchestration team which continuously monitors our services, we have a solid base for our app’s resiliency. But, as we said earlier, automation is a crucial part of the puzzle and oversight isn’t easy for humans. This is why we also deploy machine learning algorithms in our app - one of them being an auto-failover machine learning model that helps prevent lapses in our cloud service.

Balancing the Scales with Machine Learning

This machine learning model understands the in’s and out’s of the behavior patterns behind our cloud architecture and the regions we operate in, and uses that understanding to optimize our cloud use for maximum resiliency. Here is a simple explanation of how the auto-failover machine learning model works.

Coming as no surprise, our app operates in multiple cloud regions. This is a fairly common tactic to safeguard against outages and aid in disaster recovery. It also allows us to do global load balancing, meaning by running in multiple cloud regions our users will hit the instance of the application that is geographically closest to them. All of this ladders up to an app that is much more resilient - while using fewer resources and costing less money - than one run in the data center. But being multi-region isn’t where things end when it comes to cloud resiliency. Our machine learning model understands the behavior patterns of our architecture and the regions it operates in. When there’s a deviation from the typical use pattern, the machine learning model limits the blast radius and rectifies those deviations before customers are inconvenienced. In this case by predicting when cloud regions are at capacity and shifting traffic automatically before those regions start to fail. Without this machine learning algorithm we wouldn’t be taking full advantage of the cloud or of operating in multiple cloud regions. Operations teams would still have to wait for that two AM hiccup in the system — and then rush to fix it manually. Instead, we opt for graceful automatic failover: our machine learning algorithm notifies us of a potential failure and automatically switches our traffic — with no manual oversight and no disruption to the customer service. Meaning all those transfers and payments go through, and the customer is none the wiser to the work our system is putting in on the backend.

The effect is analogous to an analog weight scale, the kind with a balance arm. If you picture two identical cups containing the equivalent amount of water on either side of the scale, you have stasis, a balance. But internet traffic is never at stasis; it’s constantly moving, ebbing, and flowing. So now imagine replacing that water with internet data — and there’s a need to continuously optimize and rebalance those cups in response to that moving, ebbing, and flowing. If you do it right, the scale remains balanced, despite the intense activity within those cups. That is the end goal of our auto-failover machine learning model.

Machine learning is necessary for balance, because the algorithms are faster and more predictive of a failover than a human ever could be. In some cases, however, humans are required to verify certain choices the machine learning model would like to make.

As a member of our Edge engineering team wrote, we implement various resiliency services, such as circuit breakers and fallback and bulkhead patterns. When combined with the client-side load balancing pattern, these services help keep our mobile app stable and reliable. Client-side load balancing easily scales and handles updates efficiently. The capability is pushed to each client, distributing the responsibility for load balancing — the scale is balanced.

If a service instance responds too slowly or throws errors, the load balancer detects and provides corrective action and the unhealthy instance is removed. Should a discovery service go down, a local copy is maintained on the client so that connectivity can continue with nearly current customer information. Which is a fancy way of saying that everything about the app is architected to work hand in hand towards the goal of resilience.

Tech Because It’s Human

Most of us can understand why mobile service is king. Our lives don’t operate within business hours or within proximity to bank branches. When any of us, as consumers, are at the point of sale considering a significant purchase, we don’t want to wait until we get home to learn about our up-to-the-minute bank balances.

That’s why we opt for failing gracefully, to limit disruption to your financial life. If we do our job exceptionally well, you’ll never know there was a problem within the system.

No developer wants their app to fail; but for us, the stakes are high because we expect the best from the products we build. That’s why we use technologies such as machine-learning enabled auto-failover. It’s not technology for its own sake. It’s technology because humans are at the other end, and they are relying on us for the services that matter to them most.

Capital One Tech

Stories and ideas on development from the people who build it at Capital One.