Operations

Talks about: alerts, metrics, operations, and runbooks

Metrics

The controller registers custom metrics on the controller-runtime registry, served on --metrics-bind-address (:8080) alongside the standard controller_runtime_* and workqueue_* series. Enable scraping with the chart’s opt-in ServiceMonitor (metrics.serviceMonitor.enabled):

# values.yaml
metrics:
  serviceMonitor:
    enabled: true        # needs the Prometheus operator CRDs

Metric	Type	Labels	Meaning
`stageset_reconcile_total`	counter	`namespace`, `name`, `reason`	Reconciles, by terminal Ready reason.
`stageset_stage_applied_total`	counter	`namespace`, `name`, `stage`	Stages applied and verified.
`stageset_drift_corrected_total`	counter	`namespace`, `name`, `stage`	Out-of-band drift re-asserted on a steady-state reconcile.
`stageset_update_deferred_total`	counter	`namespace`, `name`	Rollouts held by a closed update window.
`stageset_webhook_cert_renewal_failures_total`	counter	(none)	Failed self-signed webhook cert renewals.
`stageset_stage_ready`	gauge	`namespace`, `stageset`, `stage`	`1` when a stage is Ready, else `0` — for metric-based progressive delivery.

Alerts

The chart ships an opt-in PrometheusRule with a starter alert set, gated on metrics.prometheusRule.enabled (requires the Prometheus operator CRDs). It covers the custom stageset_* metrics plus controller-runtime signals:

Alert	Fires on	Severity
`StageSetReconcileErrorsHigh`	per-StageSet Ready=False rate (excludes the healthy `Succeeded`/`Suspended` reasons)	warning
`StageSetControllerWorkqueueDepthHigh`	the reconcile queue not draining	warning
`StageSetReconcileLatencyHigh`	reconcile p99 latency over threshold	warning
`StageSetControllerPodDown`	a controller pod NotReady	critical
`StageSetWebhookCertRenewalFailing`	self-signed cert rotation failing	critical

Every threshold is a knob under metrics.prometheusRule.thresholds, and extraAlertLabels is merged onto every rendered alert so all stageset alerts can route through one Alertmanager receiver. Each alert carries a runbook_url annotation pointing at the matching runbook page on this site (metrics.prometheusRule.runbookBaseURL); the reconcile-errors alert templates the URL on $labels.reason. Append your own rules under metrics.prometheusRule.extraRules, and silence a built-in alert by raising its threshold rather than forking the chart.

Events

The controller emits Kubernetes Events on every Ready-condition transition, so kubectl describe stageset <name> and Flux’s notification-controller (via an Alert targeting kind: StageSet) both surface what happened. Normal events include Succeeded, UpdateDeferred, MigrationStarted, and MigrationCompleted; warnings include StageFailed, DriftCorrected, RolledBack, MigrationFailed, OnFailureAction, and RollbackStoreFailed.

Runbooks

Every actionable Ready-condition reason has a runbook covering the symptom, cause, diagnosis, and remediation. Set --runbook-base-url (the chart’s controller.runbookBaseURL, which defaults to this docs site) to a published copy of those pages and the controller appends (runbook: <base>/<reason>/) to the Ready message (the reason lower-cased into a path segment), so a kubectl describe links straight to the fix. Healthy reasons (Succeeded, Suspended) get no link.

# values.yaml — point at your own mirror, or set "" to drop the links
controller:
  runbookBaseURL: https://runbooks.internal/stageset

For example, a StageFailed StageSet then shows:

Message:  stage "application" failed: … (runbook: https://runbooks.internal/stageset/stagefailed/)

Forcing a reconcile

The controller reconciles on its spec.interval, on source changes, and on demand. To trigger an out-of-band run, stamp the standard annotation — which is what flux reconcile and stagesetctl reconcile do for you:

kubectl annotate stageset my-app \
  reconcile.fluxcd.io/requestedAt="$(date -u +%FT%TZ)" --overwrite

The handled token is recorded in status.lastHandledReconcileAt.

Drift correction

On a steady-state reconcile the controller re-asserts the desired state, healing out-of-band changes to managed objects. Each correction emits a DriftCorrected event and increments stageset_drift_corrected_total. Tighten the cadence with spec.driftDetectionInterval when you need faster healing than spec.interval.