The scheduler does not care how much CPU your Pod actually uses. It cares what you wrote in the manifest. I forget that regularly, even after years of running clusters. Then a Pending Pod sits for an hour, or a node runs hot, or something gets OOMKilled at the worst moment, and I remember: requests are promises, limits are guardrails, and honesty is the part nobody puts in the Helm chart comments.

Graphs and metrics on a laptop screen

Photo by Lukas on Pexels

I am not going to pretend I size workloads perfectly. I have copied 256Mi from a blog post, set limits equal to requests “for safety,” and left both unset because “it worked in dev.” All of those choices came back to teach me something. This is what I wish I had understood earlier, written the way I explain it to a teammate who is tired and just wants the Deployment green.

What the scheduler actually believes

When you set resources.requests.cpu and resources.requests.memory, you are telling the kubelet and scheduler: reserve this much on a node before you place my Pod. The scheduler sums requests across Pods on a node and compares them to allocatable capacity. It does not peek at last week’s Grafana graph unless you bolt on something like VPA or a custom scheduler.

Limits are different. CPU limits throttle your container through cgroup quotas — you can hit the ceiling and slow down without dying. Memory limits are harder: exceed them and the Linux OOM killer may terminate your process. Kubernetes surfaces that as OOMKilled. There is no gentle negotiation.

That asymmetry matters. High requests, low limits can make scheduling painful while still allowing bursts until memory blows up. Low requests, high limits makes the cluster look roomy until every Pod on the node competes for real RAM and the kernel starts choosing victims. No requests means BestEffort QoS — fine for a local kind cluster, risky for production neighbors who assumed the scheduler was protecting them.

Quality of Service classes — Guaranteed, Burstable, BestEffort — are not academic. When a node is under memory pressure, BestEffort and Burstable Pods without tight limits are more likely to be evicted or OOMKilled first. I do not treat QoS as a badge. I treat it as who gets sacrificed when the math stops working.

The gap between usage and requests

The most common failure mode I see is not malicious. It is optimistic.

A Java service idles at 400Mi but spikes to 1.2Gi during batch import. Someone sets requests.memory: 512Mi because “that’s what it usually needs.” Scheduling looks efficient. At 2 a.m. the import runs, the Pod climbs, the node fills, and either the Pod dies or its neighbor does. The dashboard showed green until it did not.

The opposite mistake wastes money: requests pinned at peak, limits double that “just in case,” and the cluster autoscaler never triggers because requested capacity looks enormous while used capacity is half empty. Finance asks why we need twelve nodes. On-call asks why HPA cannot scale. Both are right from their angle.

Honest sizing starts with measuring over time, not at a single moment:

Look at sustained usage, not spikes alone. Prometheus, metrics-server, cloud monitoring — pick one and graph CPU and memory per Pod over days and weeks. I want p50 and p95, not the peak that happened once during a bad deploy.

Separate idle from busy paths. Cron jobs, queue consumers, and admin endpoints have different profiles than steady API traffic. One Deployment with one request value for all containers is sometimes fine. Often it is not.

Include the sidecars. Istio, log shippers, and Vault agents eat memory too. I have debugged “mystery” OOMs that were just the main container’s limit with a 200Mi sidecar nobody added to the spreadsheet.

Revisit after releases. Memory leaks do not always crash locally. A slow creep over three releases can turn a comfortable 512Mi request into a lie.

I still round up a little. Aviation taught me that rounding every fuel number down is how you arrive surprised. A modest buffer between p95 usage and request is not waste; it is scheduling honesty with margin.

Limits without superstition

Limits are not a moral good. They are a tool.

For CPU, I use limits when I need to protect noisy neighbors — multi-tenant nodes, batch mixed with latency-sensitive APIs. I accept that throttling can raise latency in ways metrics-server will not scream about. For latency-critical services on dedicated nodes, I sometimes skip CPU limits and rely on requests plus node placement. That is a trade, not a rule.

For memory, I set limits when I want the kernel to kill this container instead of random processes on the node. The limit should be above observed peak with headroom, or you are choosing crash loops. Setting limit equal to request “so Kubernetes knows what we need” misunderstands the mechanism. Equal request and limit with tight numbers gives you Guaranteed QoS, which helps eviction order, but does not magically create RAM on the node.

I have watched teams set limits.memory to exactly the request, then wonder why JVM or Node heaps cannot breathe. Runtime overhead, page cache behavior, and brief allocation spikes need space. Read the OOMKilled events. They tell you the truth bluntly:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

137 is not subtle. When I see it, I check actual usage against limit before I touch application code. Half the time the limit was wrong. Half the time the application was wrong. Guessing which half without data wastes a Friday.

Scheduling pain is often a requests problem

Pending Pods with Insufficient cpu or Insufficient memory are the scheduler refusing your promise because the node cannot fit it. That is not a bug. That is the cluster telling you your manifests do not match physics.

Common causes I still stumble into:

Requests too high for the node shape. A Pod requesting 8Gi on a family of nodes with 7Gi allocatable will pend forever. Obvious on paper. Easy to miss when node pools mixed sizes after a migration.

Aggregated requests across namespaces. Quotas, LimitRanges, and taints can block scheduling with messages that look like capacity when the real issue is policy.

Fragmentation. Enough total cluster CPU free, but no single node with a contiguous 4Gi slot because many medium Pods ate the puzzle. Consolidation, right-sizing, or a second node pool sometimes fixes what “add nodes” alone does not.

Forgotten DaemonSets. They consume allocatable too. I have planned capacity for application Pods and ignored the monitoring stack on every node.

When someone asks me to “just schedule it,” I translate: either lower the request, add a node, or stop pretending the Pod needs that much. There is no fourth option that lasts.

OOM, eviction, and the human on the other end

OOMKilled and evicted Pods feel like application failures. Often they are capacity failures wearing an application mask.

Node memory pressure triggers the kubelet to evict Pods based on QoS and usage over request. You see it in events:

The node was low on resource: memory. Container X was using Y, request is Z, limit is W.

Reading that line carefully saves time. Using tells you what happened. Request tells you what the scheduler thought. Limit tells you where the cgroup ceiling was. When those three numbers tell different stories, your sizing doc is wrong, not Kubernetes.

For on-call, I try to triage in order:

  1. Is this isolated to one Pod or many on the same node?
  2. Did anything change — deploy, config, traffic, cron?
  3. What do metrics show for memory trend — step change or slow climb?
  4. Do limits and requests match reality from the last rightsizing pass?

If many Pods on one node fail together, I look at the node before I roll back application code. If one Pod fails alone, I look at limits and the release. That split is boring and it works.

How I rightsize without pretending to automate judgment

Tools help. Judgment still owns the outcome.

Vertical Pod Autoscaler in recommendation mode has saved me from spreadsheet denial. I do not auto-apply VPA changes in production without reading them. Recommendations are statistical. Black Friday is not average Tuesday.

Goldilocks and similar dashboards that compare request, limit, and usage side by side are good for quarterly hygiene. I schedule a recurring calendar invite because I will not do it spontaneously.

LimitRanges on namespaces catch “forgot to set resources” before production. Default requests and limits are blunt instruments — they stop the worst cases, they do not replace per-workload thought.

Load tests still matter. Metrics from normal traffic miss the first time a new export feature loads half a gigabyte into memory. I am not saying Gatling solves sizing. I am saying you should see the spike before customers do.

When I write resources for a new service, my rough checklist is:

  • Set requests near sustained p95 usage with modest headroom.
  • Set memory limit above observed peak; avoid limit = request unless I mean Guaranteed and understand heap behavior.
  • Decide CPU limit based on neighbor risk, not habit.
  • Document why the numbers exist in the PR, not only in my head.
  • Re-check after 30 days in production.

That last step is the one I most often skip. I am trying to get better.

Honest capacity is a team sport

Platform teams own nodes and defaults. Application teams own usage patterns. When those groups do not talk, requests become folklore.

I have seen “standard sizing” tables — small: 256Mi/512Mi, medium: 512Mi/1Gi — that made onboarding easy and production noisy. Templates are fine starting points if they come with permission to deviate when metrics disagree.

Finance sees requested capacity on the invoice. Engineering sees used capacity on the dashboard. Until someone reconciles those views, you will fight about node counts in meetings that could have been a shared Grafana board.

The humble version of this story is: I have shipped manifests that lied. They lied because I was busy, because the tutorial said 500m CPU, because copying last year’s YAML felt safe. The cluster believed me until reality did not. Requests and limits are where application architecture meets infrastructure physics. Treating them as decoration is like filing a flight plan with fuel numbers that match what you hope instead of what you burn.

I still get it wrong sometimes. The difference now is I expect to get it wrong, measure sooner, and fix the numbers before OOMKilled fixes them for me. If you are staring at a Pending Pod or a exit code 137 today, start with one question: what did we promise the scheduler, and what did we actually need? The gap between those two is usually the whole incident.