Observability see history edit this page

Talks about: , , , and

The controller exports Prometheus metrics on an HTTP endpoint: a set of custom stageset_* series describing reconcile outcomes, stage applies, drift correction, deferred rollouts, webhook-cert renewals, and per-stage readiness — alongside the controller-runtime and workqueue series every controller-runtime manager exposes for free. The Helm chart wires a metrics Service, an opt-in ServiceMonitor for scraping, and an opt-in PrometheusRule with a starter alert set whose thresholds are tunable values.

The metrics endpoint

The controller binds its metrics endpoint via --metrics-bind-address, defaulting to :8080. Setting it to 0 disables the endpoint.

The chart binds the same port through ports.metrics (default 8080) and renders a ClusterIP Service named <release>-metrics exposing a metrics port that targets the container’s metrics port. The Service is always present, so any scrape mechanism has a stable target.

Scrape with a ServiceMonitor

When the Prometheus operator is installed, opt into a ServiceMonitor that selects the metrics Service:

metrics:
  serviceMonitor:
    enabled: true
    interval: 30s
    # Extra labels so your Prometheus instance's serviceMonitorSelector picks it up.
    additionalLabels:
      release: kube-prometheus-stack

Scrape without the Prometheus operator

Without the operator’s CRDs, point a plain Prometheus scrape config at the metrics Service, or annotate the rendered Service for an annotation-driven scrape discovery setup:

kubectl annotate service <release>-metrics \
  prometheus.io/scrape=true \
  prometheus.io/port=8080

Metrics reference

Every custom metric is registered against controller-runtime’s registry, so it rides the same --metrics-bind-address endpoint as the built-in series. All counters and the readiness gauge are labelled by namespace and the StageSet name so a single StageSet’s behaviour is isolatable.

Reconcile outcomes

MetricTypeLabelsMeaning
stageset_reconcile_totalCounternamespace, name, reasonReconciles by their terminal Ready-condition reason. The reason label is one of the wire-stable Ready reasons.
stageset_update_deferred_totalCounternamespace, nameReconciles that held a rollout because an update window was closed.

Stage progress

MetricTypeLabelsMeaning
stageset_stage_applied_totalCounternamespace, name, stageStages that applied and verified successfully.
stageset_drift_corrected_totalCounternamespace, name, stageObjects whose out-of-band drift the apply corrected on a steady-state reconcile (same revision as last applied).
stageset_stage_readyGaugenamespace, stageset, stageWhether a stage is Ready (1) or not (0) at the current observed state. A metrics-gated progressive-delivery controller — Argo Rollouts’ Prometheus metric provider, for example — can hold a rollout directly on this gauge. The series is deleted when a StageSet or one of its stages goes away, so a removed stage leaves no stale gauge.

Webhook

MetricTypeLabelsMeaning
stageset_webhook_cert_renewal_failures_totalCounterFailed self-signed webhook certificate renewals in the background renewer goroutine. Meaningful only in self-signed webhook mode.

Controller-runtime and workqueue metrics

The manager also exports the standard controller-runtime series without any extra configuration, all carrying controller="stageset":

The shipped alerts build on both the custom stageset_* series and these controller-runtime signals.

Alerts

The chart ships an opt-in PrometheusRule named <release>-alerts with a starter alert set. It requires the Prometheus operator CRDs. Enable it and route every alert to one Alertmanager receiver with extraAlertLabels:

metrics:
  prometheusRule:
    enabled: true
    interval: 30s
    # Labels Prometheus selects the rule object on.
    labels:
      release: kube-prometheus-stack
    # Merged onto every rendered alert, so all stageset alerts route together.
    extraAlertLabels:
      team: platform

Shipped alerts

AlertSeverityFires whenThreshold knobs
StageSetReconcileErrorsHighwarningPer-StageSet Ready=False rate (5m window) exceeds the rate, excluding the happy Succeeded and Suspended reasons.reconcileErrorRate (0.1/s), reconcileErrorDuration (10m)
StageSetControllerWorkqueueDepthHighwarningworkqueue_depth{controller="stageset"} exceeds the depth.workqueueDepth (50), workqueueDuration (15m)
StageSetReconcileLatencyHighwarningReconcile p99 (10m window) exceeds the ceiling in seconds.reconcileLatencySeconds (30), reconcileLatencyDuration (15m)
StageSetControllerPodDowncriticalA controller pod is NotReady for the window.podDownDuration (5m)
StageSetWebhookCertRenewalFailingcriticalstageset_webhook_cert_renewal_failures_total increases over 1h beyond the count.webhookCertRenewalFailuresPerHour (1), webhookCertRenewalFailuresDuration (30m)

Each threshold lives under metrics.prometheusRule.thresholds. To silence a built-in alert without forking the chart, raise its threshold to an impossibly high value; to add alerts, append them under metrics.prometheusRule.extraRules, which renders verbatim in a separate stageset-extras group.

metrics:
  prometheusRule:
    thresholds:
      reconcileErrorRate: 0.1
      reconcileErrorDuration: 10m
      workqueueDepth: 50
      workqueueDuration: 15m
      reconcileLatencySeconds: 30
      reconcileLatencyDuration: 15m
      podDownDuration: 5m
      webhookCertRenewalFailuresPerHour: 1
      webhookCertRenewalFailuresDuration: 30m

Runbooks

Each alert carries a runbook link in its runbook_url annotation (the annotation key is metrics.prometheusRule.runbookAnnotationKey, the URL prefix is metrics.prometheusRule.runbookBaseURL, defaulting to the documentation site’s runbooks). StageSetReconcileErrorsHigh templates its link on the reason label — every Ready-condition reason maps to a runbook at /runbooks/<reason>/ — while the availability and webhook alerts point at their fixed pages (workqueue-saturation, reconcile-latency, controller-pod-down, webhook-cert-renewal).

The same reason-to-page mapping surfaces on the resource itself: the controller’s --runbook-base-url flag threads a (runbook: <base>/<reason>/) suffix onto actionable Ready-condition messages, so kubectl describe stageset links straight to the remediation page for an actionable failure. Intentional and steady states — Succeeded, Suspended — carry no suffix, since there is nothing to remediate.