Aborting a deploy is a go-around, not a failure

Every pilot practices go-arounds. You climb back to a safe altitude, fly the pattern again, and land when the approach is stable. Passengers sometimes think something terrible happened. Instructors think you did the right thing when the runway was occupied, the windshear alert fired, or the approach just felt wrong.

Aircraft approaching through clouds

Photo by Pixabay on Pexels

Deploys need the same cultural permission. Stopping or rolling back a release is not a moral failure. It is a control input when evidence says the approach is unstable.

Why we hesitate

Teams celebrate shipping. Metrics reward frequency. Rollback buttons exist but feel like they come with a side of shame. I have felt it myself: we already announced this, the ticket is “done,” the demo is tomorrow.

Aviation has an advantage here: go-arounds are graded, not gossiped. You debrief facts. In software, aborting a deploy often turns into a story about who panicked, even when the system was clearly misbehaving.

I do not have a perfect fix for culture. I do have habits that make the technical part boring enough that the social part shrinks.

What “unstable approach” looks like in production

Not every blip means stop. That is how you get noisy rollbacks. I look for patterns that match things we said we cared about before the deploy started:

Error rate or latency crossing agreed thresholds. If we wrote down SLOs in the change ticket, use them. If we did not write anything down, we are guessing under stress, which is when guesses get loud.

Health checks failing on new replicas while old ones are fine. Kubernetes will keep trying. That is not kindness; it is persistence. A Deployment stuck progressing with CrashLoopBackOff on the new ReplicaSet is a flashing “go-around” light.

Customer-visible symptoms tied to the change window. Support tickets, payment failures, empty carts. Correlation is not causation, but correlation right after you pushed image tag v2.3.1 deserves respect.

Dependency alarms you did not expect. Sometimes the app is fine and the database migration is not. Sometimes the ingress change broke session affinity. The deploy is one lever; the blast radius is wider.

Gut feel from someone who knows the system. I do not worship intuition, but I listen when the on-call engineer who has seen this service for three years says, “This smell is wrong.” We can test that smell with one more check before we commit to full rollback.

If none of those are present, hold steady, watch, and avoid hero edits. If several are present, stopping is cheaper than explaining.

Technical go-arounds in Kubernetes

The mechanics depend on how you ship. The philosophy is the same: return to a known-good state without improvising twelve changes at once.

Rolling Deployment stuck mid-rollout. kubectl rollout undo deployment/<name> is the cleanest path back to the previous ReplicaSet if the controller history is intact. I verify with kubectl rollout status and check that old Pods are ready before I tell anyone we are safe.

Helm release that should not have left the barn. helm rollback <release> <revision> if chart history is reliable. I have seen teams skip revision discipline and then discover rollback points that never existed. That is a process problem wearing a technical mask.

GitOps drift. If Argo CD or Flux synced a bad commit, revert the commit and let the controller reconcile. Fighting the cluster by hand while GitOps keeps re-applying the bad desired state is a special kind of tired.

Feature flags versus image deploys. If the change was flag-only, turn the flag off. Faster than rebuilding the world. I still record it in the incident timeline because “we toggled it” without documentation becomes mythology.

Database migrations. This is the hard lane. Some migrations are forward-only. A go-around might mean application rollback while schema stays new, which requires compatibility you planned for or pain you accept. I do not pretend every deploy can be undone in one button press. I do say we should know which category we are in before we push, not after.

Blue-green or canary. Abort often means “send traffic back” or “promote the stable color.” The win is you kept the old lane warm. The loss is when nobody tested abort paths and the green environment was already torn down.

Whatever the tool, I say out loud what I am doing: “Rolling back deployment X to revision Y.” Bridges get quieter when commands are shared, not secret.

The brief before you need the maneuver

Go-arounds work because pilots briefed them before takeoff. For deploys, my minimum brief sounds almost too small to matter:

What is changing, in one sentence a human can repeat.

What signals we watch, with numbers if possible.

Who can call stop, without a committee.

What rollback looks like, including who runs it and how we know it worked.

What we will not fix during the first ten minutes (scope control).

I write this in the change ticket even when nobody reads it until something wobbles. Especially then.

During the abort: discipline over drama

When we decide to go around, I try to behave like the checklist is boss:

Stop the rollout or revert using the prepared path. Resist patching random environment variables “just to see.”

Freeze other changes. A deploy abort plus a secret rotation plus a cache flush is three variables. You will not know which one helped.

Keep a scribe. Timestamp, command, observation. Future you is tired you.

Communicate on a schedule. “We rolled back, error rate falling, investigating root cause, next update in fifteen minutes” beats silence or a flood.

Do not assign blame while metrics are still moving. Blame can come later with data. Rarely does it need to come in the first five minutes.

I have broken every one of these rules under pressure. The rules still help when I remember them.

After: debrief without theatre

A good go-around debrief is short and kind:

What did we see that triggered the decision?

Was the trigger correct?

Did rollback work as practiced?

What would have made abort faster or safer?

What do we change before the next attempt?

I am not looking for a root-cause saga on the bridge call. I want one or two corrective actions with owners. Maybe the readiness probe was lying. Maybe the new dependency timeout was too aggressive. Maybe we needed a canary and shipped straight to everyone because we were in a hurry.

Passengers remember smooth landings after a go-around. Users remember honesty and stability. “We reverted the change and service recovered” is a complete sentence many teams undersell.

When abort is the wrong call

Rollback is not free. You can introduce inconsistency, strand partial migrations, or hide a problem that will return on the next push. Sometimes the right move is forward fix with a feature flag or a hot patch you tested in staging.

I abort when the customer impact is clear and growing, when rollback is rehearsed or low risk, or when we do not understand the failure mode fast enough to bet forward safely.

I hold and diagnose when metrics are noisy but flat, when rollback could cause more damage than waiting, or when the issue is clearly unrelated and we can prove it quickly.

Getting that wrong is part of the job. The goal is to make wrongness reversible and visible, not heroic.

Personal note

Flying taught me that the approach can feel fine until it is not. DevOps taught me that graphs can look fine until someone zooms in. I still hate pressing rollback. I hate explaining outages to customers more.

If your team treats abort as embarrassment, pick one deploy this month and walk through the undo steps in staging. Name who can call stop. Practice once so the real thing feels like training, not improvisation.

A go-around is not a failed flight. It is a flight that refused to become an accident. Deploys deserve the same dignity.