Webhook cert renewal failing

Symptom
stageset_webhook_cert_renewal_failures_total is increasing; the
StageSetWebhookCertRenewalFailing alert fires (see
operations for the alert set and its thresholds).
The current certificate keeps working until its natural expiry — that expiry is
the deadline, after which cluster-wide StageSet admission breaks.
Cause
Only applies in --webhook-cert-mode=self-signed. The in-pod renewer regenerates
the serving cert every validity/3 and patches the
ValidatingWebhookConfiguration’s caBundle. It fails when:
- the controller lost
update(orget) on the namedValidatingWebhookConfiguration(--webhook-validating-config-name), - the VWC was renamed and the flag/
resourceNamesweren’t updated, - the cert directory (
--webhook-cert-dir) became read-only.
In cert-manager mode this metric is irrelevant — cert-manager owns renewal.
Diagnosis
kubectl -n stageset-system logs deploy/stageset-controller | grep -i 'cert\|renew\|caBundle'
kubectl get validatingwebhookconfiguration <name> -o jsonpath='{.webhooks[*].clientConfig.caBundle}' | head -c 40
Remediation
- Restore
get/updateon the named VWC in the controller’s ClusterRole (resourceNamesmust include it). - Fix the
--webhook-validating-config-name/--webhook-cert-dirflags if they drifted from the deployed VWC and mount. - As a longer-term option, switch to
--webhook-cert-mode=cert-managerso renewal is handled by cert-manager.