Crew resource management on incident bridges

Crew resource management in aviation is about using the whole crew’s brain without talking over each other. Incident bridges often do the opposite. I say that as someone who has both been the person talking too much and the person silent when I should have spoken up. CRM is not a corporate training module I am trying to sell you. It is a label for habits that keep groups of stressed humans from making predictable mistakes.

Team collaborating around laptops

Photo by fauxels on Pexels

This post is about roles, language, and small structures that help on bridge calls when Kubernetes, applications, networks, and organizational politics all fail at once. None of this replaces technical skill. It just stops technical skill from canceling itself out in a noisy Zoom room.

Why bridges go wrong so quickly

Incidents compress time. Information arrives incomplete. Everyone has a theory. Seniority and volume get mistaken for correctness. Tools lag reality. Customers ask for updates before you know what broke. In that environment, the default meeting format — everyone unmuted, no explicit lead — is a trap.

Aviation learned this the hard way, over decades and with investigations that are uncomfortable to read. Cockpits moved from captain-always-right culture to explicit challenge and cross-check. DevOps war rooms do not have the same regulatory pressure, but the cognitive pattern is similar: fixed roles, shared mental model, clear authority for decisions, and psychological safety to say “I do not know yet.”

I am not asking you to call your incident commander “Captain.” I am asking you to decide who is driving before the car starts sliding.

Roles that actually help

These are the roles I have seen work on bridges that ended in reasonable time with a written timeline. Names attached matter. “We have a scribe” is weaker than “Jamie is scribe.”

Incident lead — one person coordinates. They do not fix everything alone. Their job is priority, sequencing, and decisions when the room disagrees. They ask “what do we know,” “what are we trying next,” and “who owns that action.” They protect the team from thrashing. If the lead is also deep-diving in logs for twenty minutes straight, pick a different lead or delegate the dive.

Scribe — timeline, commands run, hypotheses ruled out. This sounds optional until you need to explain to leadership what happened six hours later and nobody agrees on the order of events. The scribe writes in the incident channel, not a private notebook. Timestamps matter. “We restarted the API” is less useful than “14:32 restarted deployment checkout-api, no change in error rate.”

Comms — updates to stakeholders on a schedule, not ad hoc panic. Product, support, executives — they need predictable cadence even when the answer is “still investigating.” Comms filters noise so the lead is not answering Slack DMs during diagnosis. Comms also pushes back on premature promises: “ETA 15 minutes” when you have not found root cause yet.

Subject experts — app, platform, network, database — called in when needed, not all talking at once. Experts contribute facts and options. The lead chooses. An expert who disagrees with a decision should say so clearly, once, with evidence — then the lead decides and the room commits to trying the chosen path long enough to learn something.

Without names attached, everyone “helps” by suggesting restarts. I have watched three engineers run the same rollout undo because nobody heard the first one succeed.

Phrases that reduce chaos

Language matters under stress. Pilots use standard callouts so nobody invents new words when the altimeter is unwelcome. Your team can borrow that idea without sounding like you are LARPing an airline.

“I do not know yet” — permission to not fake certainty. Guessing on a bridge spreads faster than facts. Admitting gaps invites someone who does know to speak.

“Let’s test that hypothesis” — moves from debate to evidence. Instead of arguing whether it is DNS, run one command and read the output together.

“Hold changes” — freeze deploys, config pushes, and infrastructure edits until you understand blast radius. Continuing to change a broken system is how you lose the ability to roll back cleanly.

“Read back” — repeat the action before someone runs it. “Scaling redis to zero” deserves a pause.

“Time check” — how long have we been trying this path without improvement? Bridges drift. A five-minute nudge prevents an hour of sunk cost.

“Adding a role” — explicit and neutral. “I need a network person” beats vague “can someone look at the network.”

I still catch myself stating theories as facts when I am tired. These phrases are corrections I wish someone had used on me earlier in my career.

A two-minute opening sync that scales

Start every significant bridge with the same structure. Boring on call three, useful on call thirty.

Symptom — what users or alerts see
Impact — who is affected, severity, SLO status
Last change — deploy, config, infra, traffic shift
Current theory — one sentence, labeled as theory
Next check — one action with an owner
Roles — lead, scribe, comms confirmed

Two minutes. Same headings in the incident doc template. People join late and catch up from the channel instead of asking “what are we doing” every four minutes.

Aviation briefings work similarly: weather, route, fuel, threats, division of duties. Not because pilots forget how to fly, but because shared context reduces duplicate work.

Decision authority and disagree-and-commit

The incident lead must have authority to say “we are doing X now.” Not authority to always be right — authority to pick a path so the group stops circling. Organizations that pretend consensus during outages often get paralysis or parallel experiments.

Healthy CRM includes challenge. If the lead chooses rollback and the app owner believes the database migration makes rollback unsafe, that conflict must surface before the command runs. After discussion, the lead decides. The room commits unless new evidence arrives. Running rollback and manual schema repair simultaneously because two engineers disagreed quietly is a failure mode I have seen more than once.

Document dissent in the timeline if it mattered. Post-incident reviews are kinder when the record shows “we chose A knowing B warned about C.”

Observers, executives, and the mute button

Bridges fail socially when too many people listen without roles. Observers are fine. Observers who redirect the investigation because they have a hunch are expensive. I prefer a single comms path for leadership questions so the lead is not performing status updates while trying to think.

Executives often want confidence. Engineers often want accuracy. Comms translates between them: “We do not have root cause. We have stopped customer impact by scaling read replicas. Next update in 30 minutes.” That is honest without dumping raw uncertainty on a customer-facing support team.

Technical contributors should default to muted unless addressing the lead or answering a direct question. Unmuted bridges with twelve people sound like radio interference. I have been the interference.

When it falls apart

Too many observers, no decision maker, or the loudest senior engineer becomes the de facto lead without accepting the role — those are the patterns I recognize in bridges that lasted hours because nobody wrote down what we already tried.

Another failure mode: hero debugging. One strong engineer goes dark in a shell for forty-five minutes while the room waits. Sometimes that person saves the day. Often the room lost parallel investigation time. The lead should pull status every ten minutes: “Need more time” is valid; silence is not.

Fatigue is real on long incidents. Rotate the lead and scribe. I have made bad calls after hour six because nobody suggested a handoff. Cockpit rules limit duty time for good reason. Knowledge work has softer limits but similar curves.

Blame during the incident destroys CRM faster than any tool outage. Save accountability for the post-incident review. During the bridge, focus on restoration and evidence.

Scribe quality: what to write down

A useful timeline includes:

When the incident was declared and severity level
Alerts that fired and when they cleared
Customer reports correlated (or not) with metrics
Commands and config changes with who ran them
Hypotheses explicitly ruled out
External dependencies checked (cloud status, payment provider, CDN)
Decision points and who authorized them

Avoid paste walls of stack traces in the main timeline — link them. The timeline is for sequencing decisions, not storing logs.

After resolution, the scribe’s doc becomes the seed for the post-incident review. Runbooks get updated from this material. Skimping on scribe work means you relearn the same bridge dysfunction next quarter.

OpenShift and platform bridges

Platform incidents add CRM complexity because application teams and cluster teams share a bridge but different mental models. Application engineers think deployments and HTTP errors. Platform engineers think nodes, operators, etcd, ingress controllers, and quota.

Explicitly name a platform lead and an application lead when both worlds touch. Have each report status in their vocabulary, then translate for the room. “Ingress controller pod restarted” means something different to each side until someone states user impact.

For Operator-managed services, agree before manual edits: does Flux or OLM fight human changes? Who pauses reconciliation? Platform bridges without that agreement produce flapping fixes.

Training without pretending you are an airline

You do not need CRM posters in the break room. You need one practice session where the team runs a fake incident with roles assigned. Game days expose who naturally leads, who writes well under pressure, who should not comms to executives without editing.

Debrief the game day like a flight debrief: what worked, what confused us, what we will change in the template. Small improvements compound — a shared doc heading, a default Zoom role, a paging policy that pages one lead first instead of six people simultaneously.

I am still awkward assigning myself incident lead. I default to fixing. Learning to coordinate instead of typing is ongoing work.

Personal habits I am trying to keep

When I join someone else’s bridge, I state my name and expertise once, then stay quiet until the lead asks or I have information that changes the next check. I do not litigate architecture during restoration.

When I lead, I say out loud when I am unsure. I assign scribe early — before we need them. I set comms cadence even if stakeholders are not loud yet.

When I scribe, I ask people to repeat commands before they run. I timestamp in UTC. I note when we stopped trying a path, not only when we started.

These habits fail sometimes. I still talk over people when adrenaline spikes. Apologize, reset, continue. CRM is practice, not personality.

Closing thought

The best runbook in the world does not help if twelve people execute twelve versions of it at once. CRM is how a team applies collective skill without collective noise. Aviation formalized that because the cost of failure was visible. Our incidents are usually softer than crashes — but the hours lost, the customer trust spent, and the tired engineers afterward are real costs.

You do not need perfect culture to start. Name a lead, name a scribe, open with a two-minute sync, and hold changes until you know more. That alone puts you ahead of most bridges I have joined. I am still learning to be the participant I describe here. Writing it down is partly reminder for me, partly offer to you if it helps your next long night.