Reduce the Blast Radius of a Bad Deployment with Automated Canary Analysis

Software deployment processes differ across organizations, teams, and applications. The most basic, and perhaps the riskiest, is the “big bang deployment.” This strategy updates all nodes within the target environment simultaneously with the new software version.

This deployment strategy causes many issues, including downtime while the update is in progress. It can also be challenging to perform a rollback when problems arise in production.

A big bang deployment can be devastating. For example, it can raise issues in production or even take down the system. Since rollbacks aren’t easy, production issues can cause a prolonged, catastrophic outage.

Overall, a big bang deployment throttles the delivery pipeline, causing deployments, bug fixes, and rollbacks to take too long.

Although bad deployments cause challenges, automated canary analysis helps. Let’s explore how to reduce the blast radius of a bad deployment using Armory’s version of Spinnaker and Kayenta for automated canary analysis.

Safe and Reliable Deployments with Armory

Releasing lousy software can have a long-lasting, damaging impact on any business. Armory helps you overcome this by empowering developers with tools for automated, intelligent software delivery. Intelligent software delivery prevents large-scale outages and minimizes the blast radius of bad deployments.

DevOps teams can define Armory pipelines using templates to encourage reuse and continuous delivery (CD). These pipelines enforce best practices from Netflix, Google, and the open-source software (OSS) community. They also reduce the risk of bad deployments and large-scale outages.

Armory minimizes the impact of bad deployments using a 1-click rollback when a new deployment causes errors. Our platform also uses the automated canary analysis technique to test new releases before deploying them to production.

With this technique, the new changes deploy to a small set of users before the full release. The software automatically promotes or fails the deployment based on predefined metrics.

What is Canary Deployment?

The idea of canary deployment is based on canaries in coal mines. Miners took these birds into the mine to measure the amount of toxic gas present. Canaries are more sensitive to dangerous gases than humans, so miners would know hazardous gases were likely present if a canary died inside the mine.

Software deployment uses a version of this strategy. Instead of birds, the canary is the new software version. DevOps teams roll out a new application version in stages, deploying it to a small subset of the production infrastructure and enabling the latest version for a small set of users who help identify issues.

When ready, DevOps teams deploy the software version to a larger subset of the infrastructure and a larger set of users, and so on, until the rollout is complete. This strategy dramatically reduces the risk associated with deploying a new software version into production.

Manual Versus Automated Canary Analysis

When a DevOps team manually performs this deployment strategy, they deploy a new version of the application to a small subset of production servers and shift a small percentage of traffic to the latest version. The rest of the servers and traffic remain unchanged.

The team then looks at graphs and logs and monitors multiple metrics to determine the server health with the new version. If the results are within acceptable values, then the team deploys to a larger set of servers and traffic and repeats the process. Otherwise, they roll back the deployment and route all the traffic to the stable servers.

As we can see, manual canary analysis is tedious and time-consuming. Moreover, it doesn’t scale well when deploying to multiple servers several times a week or day and is prone to human error.

Automated canary analysis overcomes this by, well, automation. Automation makes fetching metrics and running statistical tests more accurate and less time-consuming.

Automated Canary Analysis with Spinnaker and Kayenta

The automated canary analysis (ACA) platform Kayenta integrates with Armory, which is based on the open-source multi-cloud continuous delivery platform Spinnaker. You can set up an automated canary analysis stage in a Spinnaker pipeline and use Kayenta to assess the canary’s risk by fetching user data, running statistical tests, running checks for degradation between the new and old version, and providing an aggregate score for the canary.

Based on the score, Kayenta automatically judges whether to promote or fail the canary or prompt human intervention. It does this in two phases: metric collection and judgment.

Armory implements canary analysis by running three clusters in parallel: production, canary, and baseline. Comparing the production cluster with the canary deployment wouldn’t produce reliable results because of long-running process effects. For this reason, Armory creates a baseline cluster where it deploys the application’s production version.

The platform then compares the canary cluster against the baseline cluster. The canary and baseline clusters each receive a small percentage of the traffic, while the rest goes to the production cluster. Armory then handles the lifecycle of the canary and baseline clusters.

Metric Collection (Retrieval)

In Armory’s canary pipeline stage, DevOps teams can specify the metrics to check and their sources. Armory supports Stackdriver, Prometheus, Datadog, SignalFx, and New Relic. When Armory uses different sources, it combines the various metrics into a single analysis. However, it’s considered best practice to avoid adding too many metrics in one group.

Armory retrieves these metrics from the baseline and canary clusters, tags them (baseline or canary), and stores them in a time-series database. It then passes the results to the canary judge for analysis.

The judgment stage compares the baseline and canary results from the collection stage, individually evaluates each metric, and performs statistical tests. The output is an aggregate score ranging from 0 to 100. This score falls into three categories:

  • Success: The judge promotes the canary to production.
  • Marginal: The judge needs human intervention to make a decision.
  • Failure: The judge recommends stopping the pipeline, performing a rollback, and directing traffic to production.

The judgment stage has four main steps:

  • Data validation: This ensures there is valid data for the baseline and canary metrics before analysis. If the required data is not available for either the baseline or the canary or both, Armory labels the metric NODATA and moves the analysis to the next metric.
  • Data cleaning: This prepares the raw data for comparison. This preparation includes handling and sanitizing missing values and removing outliers.
  • Metric comparison: This compares the canary and the baseline for each metric. It classifies each metric as pass, high, or low, indicating the difference between the canary and the baseline.
  • Score computation: This computes the final score based on the metric classifications. This score is a ratio of the number of pass metrics over all the metrics and is a percentage.

The calculation is for the final score is:

(pass metrics / total number of metrics) * 100

Armory then uses this score to decide whether to promote or roll back the deployment.

Next Steps

Big bang deployments can cause production issues, especially for large projects. Fortunately, Kayenta and Armory’s version of Spinnaker overcome these issues using automated canary analysis. This integration enables developers to deploy changes to production with minimal risk quickly.

Integrate Kayenta into Armory to benefit from safe, intelligent canary deployment and increased productivity. If your new software version does cause an issue in production, the blast radius will be much smaller, and metrics and automation help you roll back quicker. A smaller blast radius means less mess and happier customers.

Contact Armory today for a complimentary assessment of your software delivery practices and learn more about how your organization can benefit from safe, reliable deployments.

If you’re interested in developing expert technical content that performs, let’s have a conversation today.



If you work in a tech space and aren’t sure if we cover you, hit the button below to get in touch with us. Tell us a little about your content goals or your project, and we’ll reach back within 2 business days. 

Share via
Copy link
Powered by Social Snap