Canary deployments:  Minimize risk, maximize resiliency

Today’s customers have high expectations from online sites and web apps. They expect sites and apps to always be available and it becomes big news when they are not. Imagine a social media app is down for an hour or so. Companies will put their best people on fixing the issue as soon as possible. They will also send communications to users explaining what happened and how they are going to avoid that in the future. Internally, they will go through a lot of retrospects and post mortems to strengthen their systems and infrastructure and follow through on their promises to their users.

And this is just for social media apps. Financial sites and apps are very close to customers’ personal lives, impacting their ability to access and use their finances. Customers are increasingly performing their financial transactions online from making a payment, trading a stock, transferring money to an account, etc. Therefore, customers are increasingly expecting that financial apps will always be available. Any disruption to the availability of these apps not only impacts customer satisfaction, but also impacts the trust, reputation, and credibility of the financial institution.

How do we ensure that financial apps are always available for customers?

Changes are Inevitable to Live Systems

Enterprises in the financial world and beyond are increasingly building “always-on” systems to provide around the clock availability to their customers. These systems need constant updates to add new features, update technology stacks, etc. Changing a live system always comes with the risk of something going wrong. These failures can stem from system errors and/or application/functional defects. There are different strategies to mitigate the risk of different failures for “always-on” systems.

Blue-Green Deployment

Blue-Green Deployment is a popular technique to deploy a new version of the software without impacting current live traffic. In this approach, two infrastructure stacks are created; one is called Blue and the other one is called Green. At any given time, one of the stacks is serving live traffic while the other one is idle. The idle stack infrastructure could be turned off while not in use for deployment to avoid resource waste. During updates, the new version is deployed on the idle stack. Once all validations are complete, traffic is opened to the idle stack, the idle stack becomes active, and the active stack becomes idle.

See the diagram below:

blue and green squares with black text and white text underneath them, with blue arrows pointing to orange square geometric figures

Blue Green Deployment

Canary Deployment

Canary Deployment is a technique where traffic is slowly throttled to a new version of the software. The benefit of this approach is that the impact of any issues is lower compared to opening full traffic to new version. Once the new version is validated and there are no issues discovered, then the new version is deployed to the remaining servers on a rolling basis.

See the diagram below:

3 large yellow rectangles with smaller orange and dark green squares with white text in them. blue arrows point to orange square geometric figures

Canary Deployment

The Canary Deployment is one of the best approaches for upgrading an ‘always-on’ systems where both versions — the old and new — can run in parallel without any side effects. This requires end-to-end validations to make sure the new version is not having any issues. It also allows traffic to be opened up to a small population of users before being slowly increased. In case of any issues, this allows for a quick rollback to the previous version.

AWS Services

AWS provides a rich set of services which can be used to implement Blue-Green and Canary Deployments.

  • Route 53: Managed DNS service which provides routing and health check capabilities. The routing policy determines how Route 53 is going to respond to queries.
  • ALB: Application Load Balancer distributes incoming traffic across multiple targets — such as EC2 instances — in multiple availability zones.
  • ECS: Elastic Container Service is a container orchestration service that easily runs and scales containerized applications in Docker.
  • EC2: Elastic Compute Cloud provides scalable computing capacity in AWS.
  • ASG: An Auto Scaling group contains a collection of EC2 instances that share similar characteristics and are treated as a logical grouping for the purposes of instance scaling and management.

Implementing Canary Deployments Using AWS

There are many ways to implement Canary Deployments using AWS services. These techniques can be used for ALB/ECS stacks or ALB/ASG stacks. Canary Deployments can also be achieved by creating one stack or by creating two stacks.

1 Stack Approach: Implementing Canary Deployment using AWS Services

In this approach, there is one stack of ALB/ECS which is serving live traffic. The new version of the software is deployed either by increasing the size of ASG or by attaching another ASG. Once the new version is validated and no issues are discovered, then the new version is deployed on the remaining instances on a rolling basis.

See the diagram below:

blue rectangle with concentric black lines and small orange cloud with white text. in middle of the concentric lines is a tan rectangle with orange geometric square figures and a blue arrow
dark grey rectangle with concentric black lines and small orange cloud with white text. in middle of the concentric lines is a tan rectangle with orange geometric square figures and a blue arrow

Consider this approach where stack size is smaller and the SLA for rollback is larger. Note, this approach does not give control of specifying the percentage of traffic to the new version. ALB routes to all the EC2 instances. In the above example, once a new instance is created, since there are 4 instances, 25% traffic will go to new version.

Also, the rollback after full release could take longer than desired. To rollback, repeat the above process for the previous version of the software. The time it takes to rollback depends on number of instances and boot up time for each.

Another variation of this approach is to use two ASGs and add new instances to the second ASG. See the below diagrams:

blue and dark grey rectangle with concentric black lines and small orange cloud with white text. in middle of the concentric lines is a tan rectangle with orange geometric square figures and a blue arrow
dark grey rectangle with concentric black lines and small orange cloud with white text. in middle of the concentric lines is a tan rectangle with orange geometric square figures and a blue arrow

Consider this approach where stack size is larger and the SLA for rollback is larger. This approach does not give control of specifying the percentage of traffic to the new version. ALB routes to all the EC2 instances.

The rollback is easier while both ASGs are serving traffic rather than after shrinking the older version. Just shrink the ASG which has V2 to 0. This will terminate all the instances which have V2. However, after full release, rollback will take longer. To rollback, repeat the above process for the previous version of software. The time it takes to rollback depends on number of instances and boot up time for each.

2 Stack Approach: Implementing Canary Deployment Using AWS Services

In this approach stand up two parallel stacks of ALB/ECS. Use Route53 weighted policy to send traffic to either or both stacks. By default, 100% traffic is being served by one stack (Say, Blue). The new version is deployed on the inactive stack (In this case, Green). All the validations and health checks are run on the Green stack prior to routing any traffic. Once everything is good, a small percentage of traffic (say 10%) is opened to the Green stack by changing the weight of the Route53 record.

See the diagram below:

blue rectangle with flowchart containing blue and green squares and orange square geomtric figures. there are orange crosses, grey arrows, small orange clouds, and orange 3d crosses scattered within it

When both stacks are serving traffic, it is easy to compare the health, response times, and error rates of each stack in an automated way and decide on increasing or decreasing traffic to the new stack. If all is going well, then slowly increase traffic to the new stack until it reaches 100%.

See the diagram below:

blue rectangle with flowchart containing blue and green squares and orange square geomtric figures. there are orange crosses, grey arrows, small orange clouds, and orange 3d crosses scattered within it

The benefits of this approach are as follows:

  • Full control on percentage.
  • Should not impact the health of the existing stack.
  • Easier and quicker to rollback; time it takes to switch traffic depends on TTL value configured for Route 53 record set.

Consider this approach for business-critical low risk tolerance systems as this approach provides quickest rollback option.

Summary

In conclusion, before making a decision on deployment strategies, assess the risk tolerance of your systems, identify the risks to your live system, and choose the best approach that works for your system. Here are some general guidelines which might be beneficial to making systems more resilient.

  • Have a robust health-check mechanism.
  • Add intelligent and smart monitors and alerts.
  • Evaluate risk tolerance of each release and have the fallback options ready.
  • Decide on a deployment strategy that works best for your system.
  • Use automated testing, automated validations, and automated pipelines.
  • Add intelligence for automated rollbacks.
  • Separate Infrastructures releases (JDK upgrades, Framework upgrades) from functional releases (New features, enhancements, logic changes).
  • Have a comprehensive release communication plan.

Iftikhar Khan, Director, Software Engineering, eAPI Enterprise

Director of Software Engineering, managing cloud-native, always-on and business-critical enterprise platforms and APIs.

Explore #LifeAtCapitalOne

Innovate. Inspire. Feel your impact from day one.

Related Content

AWS Lambda Java Tutorial: Best Practices to take Cold Starts From Turtle to Hurtle
Article | April 28, 2020
aerial shot of woman in striped black and white shirt sitting next to man in black shirt and woman in grey sweatshirt. all 3 have their silver laptops open to black screen with lines of white code on them. there is a light blue gradient treatment and blue and white lines and dots overlaid on the whole image.
Article | July 10, 2019 |5 min read