Operations see history edit this page

Talks about: , , , and

Metrics

The controller registers custom metrics on the controller-runtime registry, served on --metrics-bind-address (:8080) alongside the standard controller_runtime_* and workqueue_* series. Enable scraping with the chart’s opt-in ServiceMonitor (metrics.serviceMonitor.enabled):

# values.yaml
metrics:
  serviceMonitor:
    enabled: true        # needs the Prometheus operator CRDs
MetricTypeLabelsMeaning
stageset_reconcile_totalcounternamespace, name, reasonReconciles, by terminal Ready reason.
stageset_stage_applied_totalcounternamespace, name, stageStages applied and verified.
stageset_drift_corrected_totalcounternamespace, name, stageOut-of-band drift re-asserted on a steady-state reconcile.
stageset_update_deferred_totalcounternamespace, nameRollouts held by a closed update window.
stageset_webhook_cert_renewal_failures_totalcounter(none)Failed self-signed webhook cert renewals.
stageset_stage_readygaugenamespace, stageset, stage1 when a stage is Ready, else 0 — for metric-based progressive delivery.

Alerts

The chart ships an opt-in PrometheusRule with a starter alert set, gated on metrics.prometheusRule.enabled (requires the Prometheus operator CRDs). It covers the custom stageset_* metrics plus controller-runtime signals:

AlertFires onSeverity
StageSetReconcileErrorsHighper-StageSet Ready=False rate (excludes the healthy Succeeded/Suspended reasons)warning
StageSetControllerWorkqueueDepthHighthe reconcile queue not drainingwarning
StageSetReconcileLatencyHighreconcile p99 latency over thresholdwarning
StageSetControllerPodDowna controller pod NotReadycritical
StageSetWebhookCertRenewalFailingself-signed cert rotation failingcritical

Every threshold is a knob under metrics.prometheusRule.thresholds, and extraAlertLabels is merged onto every rendered alert so all stageset alerts can route through one Alertmanager receiver. Each alert carries a runbook_url annotation pointing at the matching runbook page on this site (metrics.prometheusRule.runbookBaseURL); the reconcile-errors alert templates the URL on $labels.reason. Append your own rules under metrics.prometheusRule.extraRules, and silence a built-in alert by raising its threshold rather than forking the chart.

Events

The controller emits Kubernetes Events on every Ready-condition transition, so kubectl describe stageset <name> and Flux’s notification-controller (via an Alert targeting kind: StageSet) both surface what happened. Normal events include Succeeded, UpdateDeferred, MigrationStarted, and MigrationCompleted; warnings include StageFailed, DriftCorrected, RolledBack, MigrationFailed, OnFailureAction, and RollbackStoreFailed.

Runbooks

Every actionable Ready-condition reason has a runbook covering the symptom, cause, diagnosis, and remediation. Set --runbook-base-url (the chart’s controller.runbookBaseURL, which defaults to this docs site) to a published copy of those pages and the controller appends (runbook: <base>/<reason>/) to the Ready message (the reason lower-cased into a path segment), so a kubectl describe links straight to the fix. Healthy reasons (Succeeded, Suspended) get no link.

# values.yaml — point at your own mirror, or set "" to drop the links
controller:
  runbookBaseURL: https://runbooks.internal/stageset

For example, a StageFailed StageSet then shows:

Message:  stage "application" failed: … (runbook: https://runbooks.internal/stageset/stagefailed/)

Forcing a reconcile

The controller reconciles on its spec.interval, on source changes, and on demand. To trigger an out-of-band run, stamp the standard annotation — which is what flux reconcile and stagesetctl reconcile do for you:

kubectl annotate stageset my-app \
  reconcile.fluxcd.io/requestedAt="$(date -u +%FT%TZ)" --overwrite

The handled token is recorded in status.lastHandledReconcileAt.

Drift correction

On a steady-state reconcile the controller re-asserts the desired state, healing out-of-band changes to managed objects. Each correction emits a DriftCorrected event and increments stageset_drift_corrected_total. Tighten the cadence with spec.driftDetectionInterval when you need faster healing than spec.interval.