OpenShift day-two operations for application teams

The first deploy to OpenShift feels like arrival. Routes resolve, Pods go Ready, someone says “we are on the platform now.” Day two is when the cluster keeps running without applause — upgrades land, quotas bite, logs stop appearing, and the line between application problem and platform problem gets blurry.

I have been on application teams that treated OpenShift like rented Kubernetes with a nicer console. That works until it does not. This post is for engineers who own services on shared OpenShift — not cluster admins, but people who still need to understand what happens after go-live.

Day two is where comfort ends

Day one problems are visible: image pull errors, missing Secrets, Route host typos. Day two problems are systemic:

Cluster upgrades change API versions, operator versions, and sometimes default security profiles.

Capacity is shared; your neighbor’s batch job affects scheduling.

Observability depends on agents and collectors the platform installed — or forgot to enable in your namespace.

Backups may cover etcd and cluster metadata while your database backup remains your job — or the opposite, depending on contract.

Policy — SCCs, network policy baselines, compliance operators — evolves without a deploy from your repo.

None of this requires becoming a cluster admin. It requires knowing what you own, what the platform owns, and when to cross the bridge with a good ticket.

Upgrade awareness without owning the control plane

On managed OpenShift (ROSA, ARO) or enterprise clusters operated by a platform team, application teams rarely trigger control plane upgrades. They still feel them.

Before an upgrade window, platform teams often publish target versions and maintenance schedules. Application teams should:

Read release notes for the OpenShift version jump — deprecated API removals break manifests silently until apply time.

Run oc get apirequestcounts or use the admin API inspection tools your platform documents to find removed API versions in your namespaces.

Test in non-production on the target version or channel if a pre-upgrade cluster exists.

Check operator subscriptions for third-party operators your namespace uses — they may need channel bumps aligned with OCP.

Example check for deprecated Deployments API usage:

oc get apirequestcounts | grep -i apps
oc api-resources --api-group=apps --verbs=list

During upgrade, expect brief API unavailability, node reboots, and Pod rescheduling. Applications with single replicas and strict PodDisruptionBudgets need review before the window — not during it.

After upgrade, verify:

Routes still serve traffic
CronJobs and Jobs ran on schedule
Custom metrics and ServiceMonitors still scrape
Any webhook or admission integration still succeeds

If something breaks cluster-wide — console login, default ingress, monitoring stack — that is platform escalation territory. If only your namespace fails, start with your recent manifest changes and image tags, but mention the upgrade timestamp in the ticket.

Monitoring on OpenShift

OpenShift ships with Cluster Monitoring based on Prometheus Operator patterns. Application teams typically expose metrics through ServiceMonitor or PodMonitor objects — if RBAC and namespace labels allow it.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: shop-api
  namespace: payments-prod
  labels:
    team: payments
spec:
  selector:
    matchLabels:
      app: shop-api
  endpoints:
    - port: metrics
      interval: "30s"
      path: "/metrics"

Platform teams often require specific labels so their prometheus instance picks up your monitor. Ask for the label contract before merging YAML that never scrapes.

Useful habits:

Dashboards — use provided Grafana or your org’s central observability. Do not rely only on oc get pods.

Alerts — route application SLO alerts to your team; infrastructure alerts to platform. Mixing them creates alert fatigue and wrong escalations.

Console monitoring tab — helpful for quick checks; not a substitute for retention and query power in proper metrics storage.

openshift-state-metrics and kube-state-metrics — platform-level; still useful to understand Pod restart counts and quota pressure in your namespace.

When metrics disappear after an upgrade, check whether your ServiceMonitor labels still match the platform selector before assuming your app regressed.

Logging and where application teams fit

OpenShift Cluster Logging (Loki-based in current generations, or legacy Elasticsearch stacks depending on version and install choices) aggregates platform and workload logs. Application teams usually:

Log to stdout/stderr — twelve-factor habit; cluster agents collect from the container runtime.

Structure logs — JSON or key=value pairs help search in Kibana, Grafana Loki, or your SIEM.

Avoid writing critical logs only to ephemeral container filesystem paths — lost on restart, painful in incidents.

Know retention — platform retention is not infinite; export audit-critical logs if compliance requires longer history.

Search from CLI when the console is slow:

oc logs deployment/shop-api -n payments-prod --tail=200
oc logs deployment/shop-api -n payments-prod -c shop-api --since=1h

For multi-replica Deployments, aggregate during incidents:

oc logs deployment/shop-api -n payments-prod --all-containers=true --tail=50

If logs never appear in the central UI, escalate with namespace name, Pod label, and approximate timestamp — platform teams can trace collection pipeline gaps faster than you can guess.

Resource quotas and limits on shared clusters

Shared OpenShift clusters use ResourceQuota and LimitRange to cap namespace consumption. Platform teams define envelopes; application teams deploy inside them.

Typical ResourceQuota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: payments-prod-quota
  namespace: payments-prod
spec:
  hard:
    pods: "40"
    requests.cpu: "8"
    requests.memory: "16Gi"
    limits.cpu: "16"
    limits.memory: "32Gi"
    services.loadbalancers: "2"
    persistentvolumeclaims: "10"

LimitRange sets defaults and ceilings per Pod or Container:

apiVersion: v1
kind: LimitRange
metadata:
  name: payments-prod-limits
  namespace: payments-prod
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "4Gi"
      min:
        cpu: "50m"
        memory: "64Mi"

Check current usage before scaling or launching load tests:

oc describe quota -n payments-prod
oc describe limitrange -n payments-prod

Quota exceeded errors look like scheduling failures or admission denial — not always obvious in application logs. Learn the message shape:

Error creating: pods "shop-api-7d9f-" is forbidden:
 exceeded quota: payments-prod-quota,
 requested: limits.cpu=4,
 used: limits.cpu=14,
 limited: limits.cpu=16

That is not Kubernetes being petty. It is the contract you signed by deploying into a shared namespace.

Requests, limits, and honest capacity

Every Container should declare requests and limits — CPU and memory at minimum. OpenShift scheduling and quota math use requests; limits cap burst and OOM behavior.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: shop-api
  namespace: payments-prod
spec:
  replicas: 3
  selector:
    matchLabels:
      app: shop-api
  template:
    metadata:
      labels:
        app: shop-api
    spec:
      containers:
        - name: shop-api
          image: "ghcr.io/example/shop-api:2.4.1"
          ports:
            - containerPort: 8080
              name: http
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"

Under-requested CPU causes noisy neighbor scheduling; over-requested memory blocks others from scheduling even when idle. Review usage monthly — metrics, not guesses.

Vertical Pod Autoscaler or platform recommendations may exist; adoption depends on org policy. Do not enable autoscaling that fights GitOps-managed replica counts without coordinating ignore rules.

Backup mindset for application teams

Backup on OpenShift spans layers. Clarify what your organization actually restores:

Cluster etcd and configuration — platform scope; restores the cluster skeleton, not necessarily your database contents.

PersistentVolume snapshots — may be automated by CSI drivers or backup operators; know retention and restore RPO/RTO.

Application data — databases, object storage, message queues — often still owned by the application team even when PVs exist.

Secrets and configuration in Git — GitOps helps rebuild manifests; Secrets need separate rotation and backup policy.

Operators and CRDs — restoring a namespace YAML without the operator that reconciles it produces silent drift.

Questions to ask platform or architecture once, in writing:

What is backed up automatically for our namespace?
What restore drills run annually?
Who initiates restore — app team or platform?
Are cross-region copies available for PVs we use?

Application team habits that help regardless:

Document stateful dependencies — which PVC names matter, which external SaaS holds authoritative data.

Test restore in staging — a backup nobody has restored is optimism.

Version migration paths — restore plus major DB upgrade may be one runbook.

Do not assume “the platform backs up OpenShift” equals “my service recovers tomorrow.”

Platform-owned versus team-owned scope

Rough division I have seen work:

Concern	Usually platform	Usually application team
Control plane upgrades	yes	awareness only
Nodes, machine pools	yes	report symptoms
Cluster Operators (logging, monitoring)	yes	configure app integration
Namespace, quota, LimitRange	yes	request changes
Deployment, Service, Route	no	yes
Application database backup	sometimes partial	yes
Ingress/Router defaults	yes	configure Route
SCC assignment to service accounts	often platform	request + justify
NetworkPolicy baseline	often platform	namespace rules inside policy

Gray areas exist. The table is a conversation starter with your platform team, not universal law.

When to escalate to the platform team

Escalate early when:

Symptoms are cluster-wide — multiple namespaces, console errors, default Routes failing.

Nodes NotReady or widespread scheduling failures — not a single Deployment misconfiguration.

Operator health degraded — ClusterOperator status not Available for core components.

Certificate or identity issues — OAuth, ingress certs, service serving certs affecting many services.

Quota or SCC changes required — you need higher limits or a custom SCC; kubectl cannot self-grant.

Storage class or snapshot restore — PV provisioning fails for everyone in the class.

Network path outside your namespace — corporate firewall, DNS forwarders, egress proxies.

Suspected platform upgrade regression — timeline correlates with maintenance window.

Do not escalate empty-handed. Bring:

namespace and affected resource names
timestamps in UTC
oc describe output for failing Pods, Events, Routes
recent changes — merges, image tags, scale events
customer impact statement — error rate, failed transactions, internal only
what you already tried — rollbacks, log checks, replica scale

Example Event snippet worth pasting into a ticket:

Warning  FailedCreate  replicaset/shop-api-7d9f  admittance denied:
 container shop-api has runAsUser 1001,
 pod securityContext fsGroup is 1001,
 SCC restricted-v2 does not allow

That turns a vague “Pods pending” into an actionable SCC conversation.

Stay on the application side when one Deployment regressed after your merge, probes fail on a path you own, or logs show business logic errors — not kube Events. Platform teams respect teams that verify before escalating.

Runbooks and upgrade drills

Keep a short runbook per namespace: ownership, dependencies, quota headroom, rollback path, upgrade-window checks, and an escalation template with the fields platform asked for last time. Run the post-upgrade checklist after every maintenance, even when nothing looks broken.

Closing

OpenShift day-two operations for application teams is mostly boundaries: upgrade awareness without owning the control plane, observability through platform tooling with application instrumentation, capacity through quotas and honest requests, backup clarity instead of assumptions, and escalation with evidence instead of panic.

The platform exists to carry cluster weight. Application teams carry service reliability inside the envelope they were given. The collaboration works when both sides know which side of the line they are on — and when to walk to the bridge together.

If you run on OpenShift today, pick one production namespace and answer three questions without looking at notes: What is your quota headroom? Who do you page for a cluster upgrade failure? When did you last verify a backup restore path for your stateful tier? Gaps in those answers are day-two debt worth scheduling before the next maintenance window.