Why You Need Continuous Deployment


Manual labor vs automation, by ChatGPT

A few weeks ago a colleague described a software development team that maintained manual testing and deployment practices. Deployment was so complex that the original developers wore completing the difficult deployment as a badge of honor. The question was: how can that team move toward a Continuous Deployment model? I confess to a shudder of horror when I read that, having lived that scenario in previous organizations. While moving from a manual complex deployment scenario to an automated one is a large technical challenge, the cultural shift required for a successful change is far more important.

First we should cover some terms:

  • Continuous Integration, the CI part of what’s commonly referred as “CI/CD”: CI is the practice where developers merge code to the main branch frequently, and automated tools build and test the code after every merge.
  • Continuous Delivery: Changes are available to be deployed to customers but the act of deployment may be a manual step, often requiring approval.
  • Continuous Deployment, traditionally the CD part of CI/CD: practice where every change that passes automated tests is deployed to be available to your customers.

There is a lot of writing on how to implement Continuous Deployment using various technologies, but far less on the why, which I’ll explore in the remainder of this article.

Two publications stand out as recommended reading, and I’ll build on them. The first is The Phoenix Project, a fantastic fictional story that introduces the DevOps concept, which includes Continuous Deployment. The second is the Continuous Delivery website, a great resource linking to peer-reviewed studies and further reading.

Examples

I’ll share three examples of services I’ve worked on, each with a very different state of Continuous Deployment.

My career in three pictures, by ChatGPT

I’m not sharing these to shame or blame anyone, on the contrary each team was full of smart individuals trying to do the right thing with the tools and limitations we had at the time. We have collectively learned a lot about how to deliver software in the past two decades, and we can use these examples to explore why you want Continuous Deployment for your services.

Early on in my career I supported a development team which had scheduled deployments on a prescribed day of the week, inevitable rollbacks later that day, and a hotfix deploy later in the week. That sounds like a critique but it’s not. This team was trying to deliver business value on a weekly cadence?—?almost unheard of in the era of quarterly or even yearly releases. This took place in the early 2000’s, automated testing frameworks were almost non-existent, and the Subversion source control tool had just been released. The team had no automated builds or tests, and executables were copied by hand (usually by me) to the target servers. As I write that I can’t help but be impressed that the system worked at all. It’s easy to pick out what’s wrong, but what did they get right?

  1. Change sets were small, less than a week’s worth of changes by a few developers.
  2. Feedback from the customers was instantaneous.
  3. Rollbacks were possible and tested prior to deployment.

A service I worked on at BlackBerry, which is no longer in use, had quarterly releases with the traditional dev-complete code lockdown, a long testing phase, and a very complex and manual deployment procedure in the terrible 0200–0400 weekend change window. In spite of the thousands of hours of testing for each release, almost every deployment resulted in a rollback or multiple hotfixes to address issues that came up. The biggest challenge with this service was the sheer size of the change sets, containing three months of work for twenty or more developers. No one could possibly understand the scope of the changes, and that made every deployment risky in spite of a large and skilled QA team.

Compare that to my latest experience at Arctic Wolf, where we implemented Continuous Deployment, and some services were re-deployed almost hourly with the latest changes. The system wasn’t perfect, we definitely had bugs escape into production, but with a few exceptions we found it was almost always easier to fix the bug and deploy that change rather than try to roll back a deployment.

What Continuous Deployment Gives You

I could write an entire book on the advantages of implementing Continuous Deployment, but for the reader’s sake let’s explore Risk, Quality, Cost, Speed, Happiness and Security.

Less Risk

The change management philosophy a few decades ago, which is still pervasive in some older organizations and in use at BlackBerry in the early 2000’s, was that change is the enemy of stability. To maintain a service with five nines of uptime, changes had to be limited and carefully gated. Unfortunately that came in direct conflict with the Product & Dev team’s need to ship new functionality or fix bugs. In my second example that conflict was in clear view?—?the Operations team, driven by strict SLA’s, had a mandate to provide extremely stable services. The Product & Dev teams had to ship new functionality, and the result was infrequent but very large changes, and inevitable conflict between the two groups.

With Continuous Deployment, the change sets should be small but frequent, and new functionality is introduced in small well-understood batches. When something does inevitably go wrong, the teams can very quickly narrow down the exact change causing the issue. Smaller change sets reduce the risk of large outages.

To achieve Continuous Deployment there is an underlying assumption that your software can be deployed in a rolling manner with blue/green deployment and zero downtime. That also implies your infrastructure is defined with Infrastructure-as-Code and doesn’t require manual intervention to deploy. That also reduces risk, as deployments are automated, reproducible, and testable.

Higher Quality

A solid dependable suite of automated tests are a key requirement to implement Continuous Deployment. If your teams are committed to keeping those tests up to date, regressions should be caught before they are deployed, resulting in higher quality software. I once overheard a team state “we don’t trust our tests” and had reverted back to manual testing, and deployed once a week or less. Because they tested manually, they struggled with both velocity and quality. Once they had their test suite back under control their velocity and quality came back up.

Lower Cost

Multiple studies show that bugs become exponentially more expensive to fix the further they make it from developer to customer. There is definitely an upfront cost to developing automated testing required for Continuous Deployment, but bugs will be caught earlier in the development cycle, resulting in lower overall cost in the long run.

Faster Time to Market

Compare my BlackBerry example with the other two, where it could take up to six months for a customer to see a new feature or bug fix. In the other two examples the teams could provide new features far more quickly, in weeks or even hours.

One of the four DORA metrics (see Measuring Success below for more details) is “Lead Time for Changes”, or the time it takes from committing code to it being deployed to production. The lower that number is, the better, assuming the other metrics are also in line. An organization that can deploy a change to customers in hours is far more successful than one that takes six months.

Happier Teams

One of my favorite Slack reactions was a “#YOLO” icon that my colleagues would use to indicate they had reviewed and approved a GitHub pull request. It was used partially in jest because everyone cared deeply about the services we were deploying, but it was also in part a nod to the CI/CD pipeline and the testing infrastructure, and the trust we had in it. My stated goal as a leader was that a developer should be able to merge their code change and then step away for lunch, trusting that the CI/CD pipeline would build, test and deploy the change correctly, and if things went badly wrong either rollback the change or call them via PagerDuty. That’s a lofty goal and I don’t think we ever got all the way there. In reality my colleagues and I typically tracked changes as they were rolled out, but I wanted us to have the confidence that the automation would just work.

Happy teams #yolo, by ChatGPT

Let’s be realistic — producing happy teams is not the reason your company exists. However, happy teams will be far more productive than unhappy ones, with less burnout and turnover, so it is in your best interest as a leader to make your teams happy. Giving teams the ability to deploy frequently without intervention means they can focus on delivering value to customers, and in their spare time they will create awesome Slack reactions.

More Secure

While all the other reasons above are important, this one is my favorite and not frequently talked about. Keeping software dependencies and operating systems up to date is hard. It can take a huge amount of time and can be very disruptive when a new patch for a critical bug is released. Continuous Deployment can remove a lot of that pain.

The Log4Shell logo. Source: Wikipedia.

When the Log4Shell vulnerability was announced, I led the response for patching Arctic Wolf’s systems. It was an incredibly stressful day, but we managed to patch thousands of AWS EC2 instances throughout the day running hundreds of thousands of containers, all while keeping the services running so the customer-facing security teams could look for Indicators of Compromise in the incoming data to protect the customers. Let me restate that?—?the search load on the massive data store doubled that day, and we patched thousands of systems while handling that load. The only way we could accomplish that was using the Continuous Deployment pipeline we’d built, I can’t imagine an alternative scenario.

That’s an extreme example, but the same pattern was used for any new vulnerability?—?commit the new version to whatever dependency management system is in use, depend on the tests to catch any regressions, and depend on the automation to roll it out.

Objections

Let’s address some objections to implementing Continuous Deployment.

Not Google

A common objection is “But we’re not Netflix or Google or ____, we don’t need that”. You do not need a large team to implement CI/CD. The entire R&D team at Arctic Wolf was less than a hundred developers when we moved from weekly releases to Continuous Deployment in 2018. If I was starting a new service or company today, the first thing I would do is implement CI/CD with a Hello World example, and then start iterating from there. Even a small team of a few developers will see all of the benefits listed above.

Separation of Duties

In some regulated environments, auditors are frequently looking for separation of duties between developers and staff who have access to production systems. I can’t speak to all types of certifications, but I had good success in implementing Continuous Deployment while keeping SOC2 and ISO27001 certifications. The rationale we used was that a developer couldn’t make a change by themselves, they submitted a GitHub Pull Request that required a mandatory review by someone else, and all changes to production systems followed the same flow. The CI/CD pipeline configuration was also stored in GitHub with the same workflow, deployed using CD, and couldn’t be modified by a single developer. That setup had enough separation of duty to satisfy everyone.

Badge of Honor

Occasionally, as in the situation in my introduction, a team or an individual wears the complexity of a deploy (or builds or tests) as a badge of honor. I do understand the mindset, in the early part of my career I was a Unix SysAdmin, twiddling bits as a friend pointed out, and I was good at taking on large complex tasks by hand, with artisanal handcrafted servers. Taking that away from someone can be scary, and in some cases the individuals use their knowledge of the complexity as misguided job security. A key point is to reassure the individuals that their job security is not at stake with the change. How you approach this will depend heavily on the character of the teams. If someone refuses to change, they become a risk to the health of the organization, and then you can deal with it as a performance issue. In practice if you start small, by implementing Continuous Deployment with one service and then expand from there, this objection will usually quietly go away by itself.

Me, hand crafting artisanal servers, by ChatGPT

Not a SaaS

When the software you are delivering is not a typical SaaS, but packaged software or an appliance, deploying continuously may not be possible. However in my experience, Continuous Delivery is still possible and desirable. In this scenario you should still follow CI/CD processes, building and testing the software continually, but the actual deployment is gated by a business process triggered (or blocked) by external processes, not triggered (or blocked) by the dev team. The mechanism may differ, but the end result is the same?—?the product team enables a new feature for a customer. In the SaaS world this is usually implemented by enabling a feature flag in software that’s already been deployed, while for an app it may be deploying a new version in the App Store, and for more traditional packaged software it’s releasing a new version on a website. The same principles of Continuous Integration and Continuous Delivery apply to both situations.

Measuring Success

Google’s DevOps Research and Assessment team has put together four commonly accepted metrics to measure software delivery performance. Measuring these, even anecdotally at first, before you implement CD will mean you have a baseline to measure against during the implementation. The full “State of DevOps 2022” with the metrics is posted here, and the metrics are:

  1. Lead time for changes: measuring time from code commit to release in production.
  2. Deployment frequency: how often organizations deploy code to production.
  3. Time to restore service: the average time it takes to restore service when a service incident or a defect impacts users.
  4. Change failure rate: percentage of changes to production, resulting in degraded service subsequently requiring remediation.

Several companies are building services to measure these metrics which are worth looking at. I have no affiliation with either: CodeGem and JellyFish are two that stand out.

Next Steps

If you’ve made it this far, there are lots of great talks and writing on how to implement Continuous Deployment using different technology, and a plethora of vendors selling great products to help. I recommend first starting to track the four metrics listed above, and then starting with a small service that’s not critical to the line of business. Once you have a service or two deploying continually the benefits will become clear to your organization.

Further Reading