Reconcile latency high see history edit this page

Talks about: , , , and

Symptom

controller_runtime_reconcile_time_seconds p99 for controller="stageset" exceeds the configured threshold; the StageSetReconcileLatencyHigh alert fires (see operations for the alert set and its thresholds).

Cause

A single reconcile does a lot of work — resolve and fetch every stage’s artifact, kustomize-build, server-side apply, prune, verify readiness, and run actions — all impersonating the tenant ServiceAccount. Latency climbs when any of those is slow:

Diagnosis

kubectl -n stageset-system logs deploy/stageset-controller --tail=200 | grep -i 'slow\|timeout\|took'

Break the latency down by stage count and artifact size; a single StageSet with many large stages dominates p99.

Remediation

If the queue itself is backing up, see workqueue saturation.