A green Pod icon is not the same as ready for passengers. Readiness probes are how you tell the difference. I have learned this lesson more than once by watching traffic hit instances that were running but not yet able to serve — and by watching readiness probes lie because we pointed them at the wrong endpoint.

Commercial aircraft on approach against clouds

Photo by Pixabay on Pexels

This post is about what readiness should mean in practice, how it differs from liveness, mistakes I still see (and have made), and how to tune probes so they reflect reality instead of tutorial defaults. I am not a Kubernetes internals expert. I am someone who has been paged because / returned 200 while the application behind it was not.

What readiness should mean

Answer one question: should this Pod receive traffic right now?

That is different from “is the process alive” or “did the container start.” A Java service can be running while garbage collection pauses make it a bad target. A web server can listen on port 8080 while migrations still run. A worker might be alive but intentionally not consuming queue depth because it is waiting for a cache warm-up.

Readiness gates Service endpoints and load balancer backends. When readiness fails, the Pod is removed from Endpoints (for a standard Service). Traffic should stop flowing to that instance until readiness passes again. That is graceful degradation at the Pod level — if your probes are honest.

Things readiness commonly checks:

  • Application finished booting and can serve requests within SLO
  • Database migrations completed (if they run at startup)
  • Critical dependencies reachable — cache, DB, identity provider — within bounds you accept
  • Feature flags and config loaded
  • For workers: ready to consume work, not just alive

Things readiness should not pretend to guarantee:

  • Full dependency health for the entire stack
  • Business logic correctness
  • Zero latency forever

Readiness is a traffic gate, not a moral judgment of your architecture.

Liveness is a different question

Liveness asks: should Kubernetes restart this container?

Mixing readiness and liveness causes two painful patterns:

  1. Restart loops — liveness probe fails during slow startup; kubelet kills the container repeatedly; you never reach ready.
  2. Traffic to half-dead instances — readiness passes on a shallow check while the app cannot do useful work; liveness never fires because the process responds to HTTP.

Aviation analogy, kept subtle: before takeoff you verify items that matter for this flight. Readiness is that verification for your service. Liveness is more like “is the engine still running” — a coarser signal that something fundamental failed and you need a restart, not just to be removed from rotation.

Rule of thumb I use:

  • Readiness fails → stop sending traffic, maybe recover without restart
  • Liveness fails → kubelet restarts the container

If a slow dependency should remove you from traffic but not kill the process, that belongs on readiness, not liveness.

How Kubernetes uses probe results

The kubelet runs probes on a schedule defined by your Pod spec. Readiness failure removes the Pod IP from Endpoints objects that select it. Existing connections may persist depending on configuration; new connections should not arrive. When readiness succeeds again, the Pod returns to rotation.

During rolling updates, readiness controls when the new ReplicaSet is considered available. A too-strict readiness probe can stall a deploy. A too-loose probe marks the rollout successful while instances still warm up — then error rates spike when traffic shifts.

For OpenShift, Routes and Services follow the same Endpoint semantics for ready addresses. Operators that manage Deployments still respect probe configuration in the Pod template. Platform teams sometimes inject defaults; application teams should verify them instead of assuming.

Understanding this flow helps when debugging: kubectl get pods showing Running does not mean Ready. Check the Ready column. Check Endpoints:

kubectl get pods -l app=checkout -o wide
kubectl get endpoints checkout -o yaml
kubectl describe pod <pod-name> | grep -A10 Readiness

If Ready is false, traffic should not hit that Pod. If Ready is true and users still see errors, your probe is not measuring what you think.

HTTP, TCP, exec — picking a probe type

HTTP GET is the default choice for web services. Point at a dedicated readiness path, not necessarily /.

TCP socket checks that a port accepts connections. Fast and simple. Dangerous when the process listens before application logic is ready — common with many frameworks.

Exec runs a command inside the container. Flexible, heavier, harder to reason about in production. Useful when readiness requires a local script that checks multiple internal conditions.

Example of a starting HTTP readiness probe I might use before tuning:

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
    scheme: HTTP
  periodSeconds: 10
  timeoutSeconds: 2
  successThreshold: 1
  failureThreshold: 3
  initialDelaySeconds: 5

And a liveness probe kept deliberately simpler and more tolerant:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 20
  timeoutSeconds: 5
  failureThreshold: 3
  initialDelaySeconds: 30

Notice /ready versus /healthz. Different paths, different semantics, implemented in application code. The application team must own those handlers. Platform YAML alone cannot invent them.

Implementing /ready honestly

A readiness handler should execute the checks you are willing to fail traffic over — not every check you wish you had.

Good patterns:

GET /ready
200 if: migrations done, DB ping < 500ms, cache client connected
503 if: any required dependency unavailable

Bad patterns:

GET /ready
200 always if main() started

Also bad:

GET /ready
runs full checkout integration test including third-party sandbox

The second bad pattern marks Pods unready whenever the sandbox blips — taking capacity offline during someone else’s outage. The third marks Pods ready when only the process exists — kube sends traffic, users get 500s.

I prefer explicit JSON bodies for debugging:

{"ready": false, "reason": "database_migrations_pending"}

Logs and describe output become easier to correlate with metrics.

For gRPC services, use grpc probes (supported in modern Kubernetes) or a small sidecar HTTP translator if you must. Do not pretend TCP on the gRPC port equals application readiness.

Common mistakes I still see

HTTP probe on / when health is really on /ready. The root page serves static content or redirects while backend initialization continues. Marketing site stays green; API is not.

initialDelaySeconds too low. Probe fails during normal startup; Pod flaps Ready; rolling update never completes or old Pods carry all traffic until new ones accidentally pass.

Probe succeeds while caches warm. Error rate climbs after deploy because readiness passed before warm-up finished. Fix: readiness waits for warm-up or use progressive traffic shift (canary) with metrics gates outside kube.

No probe at all. kube thinks everything is fine because the container is running. This is the default in too many internal templates.

Readiness checks the entire dependency chain. One slow dependency removes all Pods from service — total outage instead of graceful partial degradation. Sometimes intentional; often accidental.

Same timeouts for liveness and readiness. Liveness should be slower and less sensitive to avoid restart storms.

Ignoring startup probes. Kubernetes startup probes (when used) protect slow-start containers during boot so liveness does not kill them prematurely. Readiness still controls traffic after startup completes.

I have shipped more than one of these mistakes. The post-incident pattern is always similar: we assumed Running meant ready; we assumed / meant ready; we assumed defaults from a four-year-old blog post matched our JVM heap behavior.

Tuning from measurement, not tutorials

Copying probe values from a tutorial is how you get periodSeconds: 10 everywhere. Tune from startup curves and dependency behavior.

Process:

  1. Deploy to staging with readiness failing until your app sets an internal “ready” flag truthfully.
  2. Measure time from container start to genuinely ready for production traffic — p95, not best case.
  3. Set initialDelaySeconds below that p95 with margin, or use a startup probe for long boots.
  4. Set timeoutSeconds above worst-case handler time under load.
  5. Set failureThreshold so transient blips do not drop Pods instantly — but not so high that bad instances stay in rotation for minutes.

Watch during a rollout:

kubectl rollout status deployment/checkout -n production
kubectl get pods -l app=checkout -w

Correlate with application metrics: error rate, latency, pool connections. If Ready flips true at the same moment errors spike, your probe lied.

For OpenShift deployments, oc rollout status behaves similarly. Check Route backends if external monitoring still shows errors despite Ready — caching layers and CDN origins sit outside kube probes.

Readiness during shutdown

Readiness should fail when the Pod is terminating gracefully so load balancers stop sending new requests before SIGTERM handling completes. PreStop hooks and readiness together implement graceful drain:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

Application code should also fail readiness on SIGTERM after deregistering — exact pattern depends on framework. The goal: stop new work, finish in-flight work within terminationGracePeriodSeconds, exit.

Skipping this produces 502s during node drains and deploys even when probes looked perfect in steady state.

When not to use readiness

Not every Pod should receive a readiness probe. Batch jobs, one-shot CronJobs, and pure workers without Services may need liveness or nothing — not HTTP readiness on a meaningless port.

Stateful systems with long leader election may need custom readiness that returns ready only for the elected leader and not for followers. Sending traffic to all replicas when only one should serve writes is a design problem probes must reflect.

DaemonSets that expose node-local services still need honest readiness if anything upstream balances across nodes.

Automation and GitOps notes

Probes belong in version control beside the Deployment manifest. Review them in pull requests like application code. A tightened timeout merged Friday afternoon becomes a weekend outage.

For Helm charts, expose probe paths and thresholds as values — but document sane defaults and why they exist. For Kustomize overlays, environment-specific tuning (staging shorter, production stricter) should be visible in diffs.

Policy engines (OPA Gatekeeper, Kyverno) can require readiness probes on Deployments in clusters where platform standards matter. I prefer education first, policy when teams repeatedly skip probes — but I understand platform teams who choose policy after the third preventable incident.

Aviation angle, kept honest

Preflight checks are not performed because the airplane is usually broken. They are performed because some failures are easier to catch on the ground than after commitment. Readiness probes are ground checks for instances. They do not guarantee flight safety — your architecture, limits, and operators still matter — but they prevent obviously unready Pods from receiving traffic.

Do not extend the metaphor too far. Pilots do not restart engines every ten seconds based on an automated nag. Kubernetes will restart on liveness failure. Design probes so they nag appropriately.

What I am still learning

Per-application quirks dominate. JVM services, Node event loops, Python WSGI workers, .NET generic hosts — each has startup shapes I cannot fully predict from YAML alone. I talk to application developers instead of guessing.

Service meshes add another readiness story: sidecar ready versus application ready. If Envoy is up but the app is not, traffic may still flow through the sidecar to a dead backend depending on configuration. Mesh documentation deserves its own page; the lesson here is probe what the user experiences, not only what the platform sees.

I am also learning when to move gates out of kube entirely — canaries based on Prometheus error rates, autoscaling on queue depth — and keep readiness as a coarse gate. Probes are cheap insurance when honest. They are not the whole reliability program.

Practical checklist before you merge

  • Dedicated readiness endpoint implemented in app code
  • Liveness endpoint simpler and harder to false-positive
  • Paths and ports match container configuration
  • Timeouts tested under load, not idle
  • Rolling update observed in staging with metrics
  • Graceful shutdown fails readiness before exit
  • Runbook mentions what a failing readiness looks like externally

Closing thought

Readiness probes are boring configuration until they are not. When they work, deploys feel boring in a good way — traffic shifts, error rates stay flat, nobody pages you for warm-up 500s. When they fail or lie, you debug mysterious partial outages that look like load balancer ghosts.

Green Pod icons are seductive. I try to treat Ready=true as a claim that deserves evidence, the same way a preflight item is only complete when you actually checked it, not when you assumed someone else did. I still get this wrong sometimes. Measuring startup honestly and fixing probes after incidents is how I improve without pretending I already know every application’s boot story.