Reconcile latency high

Symptom
controller_runtime_reconcile_time_seconds p99 for controller="stageset" exceeds
the configured threshold; the StageSetReconcileLatencyHigh alert fires (see
operations for the alert set and its thresholds).
Cause
A single reconcile does a lot of work — resolve and fetch every stage’s artifact,
kustomize-build, server-side apply, prune, verify readiness, and run actions — all
impersonating the tenant ServiceAccount. Latency climbs when any of those is slow:
- large artifacts or slow artifact servers,
- many objects per stage (apply + prune scale with object count),
- readiness waits and
wait/http/jobactions with long timeouts, - apiserver or tenant-authorization slowness.
Diagnosis
kubectl -n stageset-system logs deploy/stageset-controller --tail=200 | grep -i 'slow\|timeout\|took'
Break the latency down by stage count and artifact size; a single StageSet with many large stages dominates p99.
Remediation
- Split a very large StageSet into smaller ones, or fewer objects per stage.
- Tighten action
timeouts so a slow gate fails fast instead of stretching the reconcile. - Raise
spec.intervalwhere freshness isn’t critical. - Address upstream artifact-server or apiserver latency.
If the queue itself is backing up, see workqueue saturation.