Alert fatigue and the discipline of cockpit radio

The first time I flew into a busy Class B airport as a student, the radio was incomprehensible. Controllers talking fast. Pilots answering in shorthand. Calls overlapping. My instructor let me flail for one approach, then said something I’ve never forgotten: “Listen before you transmit. Say what matters. Stop when you’re done.”

Air traffic control radar screens

Photo by Tom Fisk on Pexels

Monitoring systems don’t use 121.5 MHz, but they produce the same cognitive weather when nobody enforces discipline. Pages at 3 a.m. Slack channels that never idle. Dashboards where every panel is red because the threshold is wrong. On-call engineers who mute notifications not because they’re lazy, but because the signal drowned years ago.

I don’t have a perfect alerting setup. I’m not going to sell you a vendor or a one-size template. What I can offer is a comparison that keeps helping me prune noise: alert fatigue is what happens when your monitoring behaves like a crowded frequency with no radio discipline.

A frequency with no rules

Aviation radio works because the rules are narrow and repeated:

Who you’re calling — “Boston Approach, Cessna 123AB”
What you need — “requesting vectors ILS 4R”
Essential readbacks — altitudes, headings, clearances
Silence when listening — you don’t step on other transmissions

Break those rules and you get blocked, corrected, or ignored. The system assumes congestion and designs for clarity under load.

Monitoring stacks often do the opposite:

Alerts fire with no clear owner (“something in the cluster”)
The same condition pages five channels because three teams copied the same Prometheus rule
Warnings and criticals differ only in emoji
Auto-resolve is missing, so old fires smolder in the UI
Every deploy triggers a brief CPU blip that wakes three people

The on-call engineer becomes the controller trying to separate traffic with half the calls missing a callsign. Except controllers get training. We get a PagerDuty schedule and a hope.

I’m not blaming tools. I’ve misconfigured all of them. The discipline problem is human and organizational before it’s technical.

Warnings vs emergencies: not every blip is Mayday

In flying, urgency has vocabulary. Pan-Pan for urgent but not immediate distress. Mayday for immediate threat to life. Controllers respond differently; pilots don’t use Mayday for a rough mag check on the ground.

In ops, we flatten severity until everything is CRITICAL or nothing is:

Aviation habit	Monitoring equivalent
Mayday	Customer-visible outage, data loss in progress, security active breach
Pan-Pan	Degraded but serving; SLO burn; single AZ loss with failover working
Routine call	Ticket next business day; capacity trending bad in two weeks
Guard frequency chatter	Log noise; non-actionable debug events

When your platform pages on-call for Pan-Pan and guard chatter, fatigue isn’t a mystery. It’s policy.

We spent a year slowly re-labeling alerts at one job. Boring work. The rule we tried to enforce: if the recipient can’t take a concrete first action within five minutes, it’s not a page. It might still be a ticket, a dashboard annotation, or a daily digest. It shouldn’t vibrate a phone.

Kubernetes makes this harder because the same symptom means different things. CrashLoopBackOff on a batch worker overnight might be a cron job with bad input. On the payment service it’s Mayday. Context belongs in the alert — namespace, workload, runbook link — like a callsign. “CrashLoopBackOff” alone is “aircraft in distress somewhere in New England.”

Listen before you transmit

My instructor’s “listen before you transmit” maps to observe before you alert.

Good controllers build a picture before issuing vectors. Good alerting pipelines ask: is this still true? is it novel? is someone already working it?

Practices that reduced our noise without hiding real fires:

Grouping and inhibition. Multiple pods failing the same probe during a node drain shouldn’t page once per pod. One alert, node name in the label, inhibition while the drain annotation is present. We missed real node failures the first time we got this wrong — tuning took weeks.

Rate limits and deduplication. Same alert firing every minute for an hour shouldn’t re-notify every minute unless severity escalates. PagerDuty and Alertmanager both support this; we still find rogue routes that bypass them.

SLO-based paging. Burn-rate alerts on error budget — page when the customer experience is threatened, not when one replica hiccuped. I don’t implement SLOs perfectly. Even a rough SLI beats paging on CPU > 80% because someone copied a Datadog monitor in 2019.

Deploy windows. Suppress known-benevolent blips during rollout, with a hard cap — if suppression lasts more than N minutes, escalate. Like announcing maintenance on ATIS so approach doesn’t treat every go-around as an emergency.

Listen-first also means on-call reads the room: existing incident channel, recent deploys, status page already red. Adding another siren when the war room is full helps nobody.

Say what matters, then stop

Radio brevity isn’t rudeness; it’s respect for shared attention. Alert text should carry:

Service / user impact — “checkout API 5xx rate 12% (baseline 0.1%)”
Scope — cluster, region, namespace, deployment
Suggested first check — link to runbook section or one kubectl command, not a wiki homepage
Ownership — team or escalation policy

Stop means don’t attach seventeen graphs to the page. Don’t CC six Slack channels “for visibility.” Don’t chain alerts that differ only in the word “critical” repeated.

I write runbook links into annotations where I can. When I’m tired on call, “see runbook” with no URL is the monitoring equivalent of “contact approach” with no frequency.

For Kubernetes-native alerts, the difference between useful and useless is often labels:

# Alert annotation that actually helps at 2 a.m.
annotations:
  summary: "High 5xx on ingress nginx in prod-eu"
  runbook_url: "https://wiki.example/runbooks/ingress-5xx#tls-expiry"
  dashboard: "https://grafana.example/d/ingress-eu"

The query matters too. rate(http_requests_total{status=~"5.."}[5m]) without a job or ingress label is a Mayday broadcast with no position report.

Readbacks: closing the loop

Controllers require readbacks on critical items so everyone shares the same reality. Alerts without closure train people to ignore the next one.

We tried — imperfectly — to require incident notes for every page, even if the note is “false positive, threshold adjusted.” Five sentences max. What fired, what we checked, what we changed. Without that, the same false positive pages every Tuesday until someone leaves the team and knowledge evaporates.

For false positives specifically, the readback is: fix the alert or downgrade it within 48 hours. Not “ack and mute forever.” Aviation would revolt if the same bogus NOTAM appeared daily. We normalized that in monitoring for years.

When a page is real but handled quickly, the readback still matters: post-mortem optional for small stuff, but a Slack thread tag #alert-tuning helps the next person. I hate process for process’s sake. This one saved us duplicate pages because three people independently “fixed” the same rule.

The mute button is not CRM

Pilots don’t turn off the radio because approach is chatty. They change frequency, request quieter vectors, or get vectored to less busy airspace. Muting is a failure mode of last resort.

When engineers mute Slack or disable PagerDuty mobile notifications, management sometimes reads it as disengagement. Often it’s rational adaptation to a hostile signal-to-noise ratio. Fixing that is leadership and tooling, not a lecture about ownership.

Things that actually reduced muting on teams I’ve been on:

Executive agreement that on-call is allowed to sleep if secondary will wake for true criticals
Quarterly alert audits with delete privileges — sacred cow rules get removed or fixed
Rotating “noise duty” — one engineer per sprint triages firing-but-non paging alerts
Separating work queues from wake queues — email digest vs SMS

CRM in the cockpit is about speaking up when something matters and shutting up when it doesn’t. Same contract for alerts: psychological safety to say “this page is worthless” without being labeled not a team player.

Platform patterns that respect the frequency

A few Kubernetes-specific patterns we’ve leaned on:

PodDisruptionBudget violations during voluntary disruptions — often warn in chat, don’t page, if the disruption is annotated and SLOs hold.

Pending pods — page when pending > N minutes and workload is user-facing and not a known quota issue. Bare Pending is too chatty alone.

Node NotReady — page platform on-call with node name; inhibit per-pod crash loops on that node to avoid a burst of child alerts.

Cert expiry — Pan-Pan at 14 days, Mayday at 72 hours if auto-renew failed. One alert per cert source, not per secret object duplicate.

HPA maxed out — trending ticket unless latency or errors cross customer threshold. Max replicas is often a capacity plan signal, not an emergency.

Your fleet will differ. The habit is asking “what action does this page demand?” before merging the rule.

What good sounds like

Quiet on-call doesn’t mean nothing ever breaks. It means when the phone rings, the team trusts it’s worth waking up.

Good weeks sound like:

One or two pages, both actionable
False positives get tickets, not shrugs
Deploy-related blips stay in the deploy channel
Incidents generate alert follow-ups in the same week

Bad weeks sound like:

“Did anyone else get paged?” threads at 4 a.m.
On-call completes shift with twenty acks and zero root fixes
New hires told “just ignore that one”

I’ve lived in both. The exit path was never a single tool migration. It was slow radio discipline: fewer voices, clearer callsigns, appropriate urgency, readbacks.

What I’m still learning

I still create alerts without the five-minute action test. I still copy YAML from old repos and forget to update the runbook URL. I still debate whether a burn-rate page is too slow for our traffic shape.

Alert fatigue isn’t a moral failure of on-call engineers. It’s feedback that the system is transmitting on top of itself. Aviation solved that with procedure and culture before better radios existed. We can borrow the culture part without pretending we’re in a cockpit.

Listen before you transmit. Say what matters. Stop when you’re done. Fix the guard frequency so the next Mayday gets answered fast.

That’s the whole discipline. Everything else is tuning and humility.