Controller pod down

Symptom
A stageset-controller pod is NotReady; the StageSetControllerPodDown alert
fires. While no replica is Ready, StageSets are not reconciled and the
Kubernetes admission webhook may reject StageSet
writes (failurePolicy: Fail).
Cause
- a crash-looping container (bad config flag, missing RBAC, panic),
- the node draining or out of resources,
- a failing readiness probe (
/readyzon--health-probe-bind-address), - the leader-election lease unobtainable.
Diagnosis
kubectl -n stageset-system get pods -l app.kubernetes.io/name=stageset-controller
kubectl -n stageset-system describe pod <pod>
kubectl -n stageset-system logs <pod> --previous --tail=200
Look for flag-parse errors at startup, RBAC Forbidden on the controller’s own
ServiceAccount, or OOMKills.
Remediation
- Fix the surfaced cause (correct the flag/values, grant the missing controller RBAC, raise resource limits).
- Run more than one replica with leader election so a single pod failure doesn’t stop reconciliation — see production.
- If admission is blocking writes during the outage and you must unblock urgently,
scope or relax the webhook
failurePolicy, then restore it once the controller is healthy.
See operations for the full alert set and its thresholds.