# StageSet Controller — full documentation > The complete StageSet Controller documentation (https://stageset.projects.metio.wtf/) concatenated for > LLMs. For a concise link index see https://stageset.projects.metio.wtf/llms.txt. # StageSet Controller `stageset-controller` is a [Flux](https://fluxcd.io/) controller for ordered, gated, multi-stage delivery. Flux's `kustomize-controller` and `helm-controller` apply an artifact in one shot. That fits most releases, but not one that has to happen in sequence: install the CRDs before the operator that needs them, run a database migration before the app that reads the new schema, hold a production rollout until the canary is healthy, freeze changes during business hours. A `StageSet` describes a release as an ordered list of stages. Each stage applies a Flux source — a `GitRepository`, `OCIRepository`, `Bucket`, or an [`ExternalArtifact`](https://fluxcd.io/flux/components/source/externalartifacts/) (including one rendered on the fly by a producer like [JaaS](https://jaas.projects.metio.wtf/)) — waits for it to become healthy, and only then lets the next stage begin. Between stages, run typed actions (a migration `Job`, an HTTP gate, a wait-for-condition), gate rollouts behind [update windows](/usage/update-windows/), and run version-aware [migrations](/usage/versioned-migrations/) when you cross a release boundary. Everything is reconciled continuously, drift-corrected, and pruned with ApplySet semantics. ## What a StageSet looks like The smallest useful StageSet is one stage pointing at one artifact: ```yaml apiVersion: stages.metio.wtf/v1 kind: StageSet metadata: name: my-app namespace: default spec: stages: - name: app sourceRef: name: my-app # an ExternalArtifact in this namespace ``` The same shape scales up to a gated rollout: ```yaml apiVersion: stages.metio.wtf/v1 kind: StageSet metadata: name: payments namespace: payments spec: serviceAccountName: payments-deployer # every apply is impersonated as this SA stages: # 1 ── shared infrastructure: CRDs, namespaces, RBAC - name: infrastructure sourceRef: name: payments-infra # an ExternalArtifact readyChecks: checks: - apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition name: ledgers.payments.example # 2 ── the application, started only once infrastructure is Ready - name: application sourceRef: name: payments-app actions: pre: - name: db-migrate # runs before the manifests are applied job: sourceRef: name: payments-migrations post: - name: smoke-test # stage is Ready only if this passes http: url: https://payments.internal/healthz expectedStatus: [200] # new revisions roll out only outside the Friday-evening change freeze updateWindows: - type: Deny schedule: "0 17 * * FRI" duration: 60h timeZone: Europe/Berlin ``` Stages run top to bottom. `infrastructure` must report Ready (its CRD established) before `application` is touched; the migration Job runs before the app is applied; the rollout is held when the change-freeze window is open. Everything is continuously reconciled — drift is corrected, removed objects are pruned. ## Where to go next - **[Installation](/installation/)** — install on Kubernetes, then harden for production and wire up observability. - **[Usage](/usage/)** — worked examples for every feature, from a single stage to versioned migrations. - **[CLI](/cli/)** — `stagesetctl` for previewing (`diff`), rendering (`build`), and driving (`reconcile`) StageSets. - **[API reference](/api/)** — every field of every custom resource, explained. - **[Comparisons](/comparisons/)** — how StageSet relates to Helm, Kustomize, Tanka, kubecfg, and plain Flux. - **[Runbooks](/runbooks/)** — symptom → cause → remediation for every status reason. ## Related projects `stageset-controller` handles the delivery end and composes with two adjacent projects, each useful on its own: - **[JOI](https://github.com/metio/jsonnet-oci-images)** publishes Jsonnet libraries as single-layer OCI images. - **[JaaS](https://jaas.projects.metio.wtf/)** evaluates Jsonnet on demand and publishes the result as a Flux `ExternalArtifact`. - `stageset-controller` takes those artifacts and rolls them out, in order, with gates. JOI and JaaS are not required — a stage reads straight from a `GitRepository`, `OCIRepository`, or `Bucket`, or from any `ExternalArtifact`, whatever produced it. --- # Installation Source: https://stageset.projects.metio.wtf/installation/ Get `stageset-controller` running on a [Kubernetes](https://kubernetes.io/docs/) cluster, then keep it healthy in [production](/installation/production/). --- # Configuration reference Source: https://stageset.projects.metio.wtf/installation/configuration/ The controller is configured entirely through command-line flags, grouped below by subsystem. When deployed via the Helm chart you never pass these directly — the chart sets them from your values and its own defaults; each section notes the Helm value that drives a flag, and the [metio/helm-charts](https://github.com/metio/helm-charts/tree/main/charts/stageset-controller) repo carries the full values reference. For the Helm values worth tuning and the reasoning behind each, see [Production](/installation/production/#settings-you-can-tune); for metrics and runbooks, [Operations](/installation/operations/). ## Manager and leader election | Flag | Default | Description | Helm value | |---|---|---|---| | `--health-probe-bind-address` | `:8081` | Address the liveness and readiness probe endpoints bind to. | _chart-managed_ | | `--leader-elect` | `false` | Enable controller-runtime leader election so only one replica reconciles at a time. Recommended for HA deployments. | `controller.leaderElect` | The leader-election lease name is fixed at `stageset-controller.stages.metio.wtf` and is created in the namespace the controller pod runs in. ## Watch scope | Flag | Default | Description | Helm value | |---|---|---|---| | `--watch-namespaces` | _(empty)_ | Comma-separated list of namespaces the controller watches. Empty (the default) means cluster-wide. When set, the manager's cache only observes StageSets and sources in these namespaces — the multi-tenant controller-instances pattern. Falls back to the `STAGESET_WATCH_NAMESPACES` environment variable when the flag is empty. | `controller.watchNamespaces` | **Environment variable:** `STAGESET_WATCH_NAMESPACES` — comma-separated namespace list. When `--watch-namespaces` is non-empty the flag takes precedence. When restricted, the chart pivots RBAC to per-namespace RoleBindings instead of a cluster-wide ClusterRoleBinding. ## Reconciliation defaults | Flag | Default | Description | Helm value | |---|---|---|---| | `--default-interval` | `10m` | Reconcile cadence for StageSets that omit `spec.interval`. | `controller.defaultInterval` | | `--inventory-mode` | `hybrid` | Inventory strategy for tracking applied resources: `entries`, `hybrid`, or `applyset`. | `controller.inventoryMode` | | `--inventory-shard-cap` | `5000` | Maximum number of resource entries per `StageInventory` shard. | `controller.inventoryShardCap` | | `--no-cross-namespace-refs` | `false` | Deny `sourceRef` and `dependsOn` references that target a different namespace. | `controller.noCrossNamespaceRefs` | | `--allowed-action-hosts` | _(empty)_ | Host glob allowed for `http` actions; repeatable. Loopback and link-local ranges are always denied unless explicitly listed. | `controller.allowedActionHosts` | | `--runbook-base-url` | _(empty)_ | URL prefix appended to actionable Ready condition messages as `(runbook: //)`. Empty disables. | `controller.runbookBaseURL` | ## Rollback store — filesystem The rollback store preserves a copy of each stage's last-applied artifact so that a rollback can re-apply the previous revision without re-fetching from the producer. The filesystem backend is appropriate for single-replica deployments or multi-replica deployments backed by an `RWX` volume. `--rollback-store-path` and `--rollback-store-s3-endpoint` are mutually exclusive. Both empty disables the store; rollback falls back to re-fetching the producer artifact. | Flag | Default | Description | Helm value | |---|---|---|---| | `--rollback-store-path` | _(empty)_ | Filesystem directory (e.g. an RWX PVC mount) for the rollback store. Empty disables the filesystem backend. | `rollbackStore.backend: pvc` | The file store writes rendered output — including Secret data — in the clear. The volume must provide encryption at rest (encrypted StorageClass, LUKS, or cloud-disk encryption). ## Rollback store — S3 Active when `--rollback-store-s3-endpoint` and `--rollback-store-s3-bucket` are both non-empty. | Flag | Default | Description | Helm value | |---|---|---|---| | `--rollback-store-s3-endpoint` | _(empty)_ | S3-compatible endpoint (`host:port`, e.g. `s3.amazonaws.com` or `minio.minio.svc:9000`). Empty disables the S3 backend. | `rollbackStore.s3.endpoint` | | `--rollback-store-s3-bucket` | _(empty)_ | S3 bucket for the rollback store. Must already exist. | `rollbackStore.s3.bucket` | | `--rollback-store-s3-prefix` | _(empty)_ | Optional object-key prefix so the rollback store can coexist with other tenants in one bucket. | `rollbackStore.s3.prefix` | | `--rollback-store-s3-region` | _(empty)_ | S3 region. Required for AWS multi-region buckets; ignored by most S3-compatible servers. | `rollbackStore.s3.region` | | `--rollback-store-s3-use-ssl` | `true` | Use HTTPS to talk to the S3 endpoint. Set to `false` only for local MinIO over plain HTTP. | `rollbackStore.s3.useSSL` | | `--rollback-store-s3-access-key` | _(empty)_ | Static access key. Empty engages minio-go's IAM/IRSA credential discovery chain (env → web-identity → EC2/EKS metadata). | `rollbackStore.s3.existingSecret` | | `--rollback-store-s3-secret-key` | _(empty)_ | Secret key, paired with `--rollback-store-s3-access-key`. | `rollbackStore.s3.existingSecret` | | `--rollback-store-s3-session-token` | _(empty)_ | Optional session token for temporary credentials (e.g. IRSA). | `rollbackStore.s3.existingSecret` | | `--rollback-store-s3-anonymous` | `false` | Skip request signing. For public buckets only. | `rollbackStore.s3.anonymous` | | `--rollback-store-s3-sse` | `s3` | Server-side encryption for stored objects: `none`, `s3` (SSE-S3), or `kms` (SSE-KMS). The store holds rendered Secret data, so encryption is on by default. Set `none` only for a bucket whose backend cannot honor an SSE header. | `rollbackStore.s3.sse` | | `--rollback-store-s3-sse-kms-key` | _(empty)_ | KMS key ARN or ID for `--rollback-store-s3-sse=kms`. Empty uses the bucket's default KMS key. | `rollbackStore.s3.sseKmsKeyId` | ## Metrics and health | Flag | Default | Description | Helm value | |---|---|---|---| | `--metrics-bind-address` | `:8080` | Address the controller-runtime Prometheus metrics endpoint binds to. `"0"` disables. | _chart-managed_ | The metrics endpoint exposes standard `controller_runtime_*` and `workqueue_*` series alongside the custom `stageset_*` metrics documented in [Operations](/installation/operations/). ## Webhook and TLS provisioning The validating admission webhook for `StageSet` is enabled by default. Two TLS provisioning modes are supported. | Flag | Default | Description | Helm value | |---|---|---|---| | `--enable-webhook` | `true` | Enable the validating admission webhook for `StageSet`. | _chart-managed_ | | `--webhook-cert-mode` | `cert-manager` | TLS provisioning mode: `cert-manager` (chart renders a `Certificate` CR; cert is mounted from a Secret) or `self-signed` (the controller generates a CA and serving cert in-pod and patches the `ValidatingWebhookConfiguration` `caBundle`). | `webhook.certMode` | | `--webhook-cert-dir` | `/tmp/k8s-webhook-server/serving-certs` | Directory holding `tls.crt` and `tls.key` for the webhook server. | _chart-managed_ | | `--webhook-port` | `9443` | Port the validating webhook server binds to. | _chart-managed_ | | `--webhook-cert-validity` | `8760h` (1 year) | Validity of the self-signed serving cert. The controller rotates it every `validity/3`. | `webhook.*` | | `--webhook-service-name` | `stageset-controller-webhook` | Kubernetes Service the webhook is reachable through. Used to build cert SANs in `self-signed` mode. | _chart-managed_ | | `--webhook-service-namespace` | _(empty)_ | Namespace of the webhook Service. Empty falls back to the in-cluster ServiceAccount namespace. | _chart-managed_ | | `--webhook-validating-config-name` | _(empty)_ | Name of the `ValidatingWebhookConfiguration` whose `caBundle` the controller patches. Required when `--webhook-cert-mode=self-signed`. | _chart-managed_ | ## Gate endpoint The gate endpoint exposes a read-only HTTP API for Flagger canary stage-gates. `GET /gate/{namespace}/{stageset}/{stage}` returns `200` when the named stage is ready to advance and `503` otherwise. | Flag | Default | Description | Helm value | |---|---|---|---| | `--gate-bind-address` | `:8082` | Address for the Flagger stage-gate endpoint. Empty disables the endpoint. | `gate.enabled` | ## Logging Logging is powered by the controller-runtime `zap` logger. The standard zap flags (`--zap-log-level`, `--zap-encoder`, `--zap-stacktrace-level`, `--zap-time-encoding`, and `--zap-devel`) are available and bound to `flag.CommandLine`; run `stageset-controller --help` to see their current defaults. --- # Install on Kubernetes Source: https://stageset.projects.metio.wtf/installation/kubernetes/ ## Prerequisites - A [Kubernetes](https://kubernetes.io/docs/) cluster with `kubectl` and [`helm`](https://helm.sh/) configured against it. - [Flux](https://fluxcd.io/) `source-controller`, specifically the `ExternalArtifact` API (`source.toolkit.fluxcd.io`). A `StageSet` stage always resolves to an `ExternalArtifact`, so the CRD must exist. `ExternalArtifact` lands in Flux **v2.7.0**; install at least that version. The controller also watches `GitRepository`, `OCIRepository`, and `Bucket` sources for producer-aware resolution. - [cert-manager](https://cert-manager.io/), only if you choose the `cert-manager` webhook certificate mode. The chart defaults to `self-signed`, which provisions and rotates the admission webhook's TLS in-process and needs no cert-manager. See [production](/installation/production/#admission-webhook-tls) for the trade-off. [JaaS](https://jaas.projects.metio.wtf/), JOI, or any particular artifact producer are not required to install the controller — those are sources of `ExternalArtifact`s, wired up per `StageSet`. ## Install with Helm The controller is distributed as an OCI [Helm](https://helm.sh/) chart. The deployment manifests live in the chart, not in the controller repository. ```shell helm upgrade --install stageset-controller \ oci://ghcr.io/metio/helm-charts/stageset-controller \ --namespace stageset-system --create-namespace ``` The container image is `ghcr.io/metio/stageset-controller`; the chart pins the tag to its own `appVersion` by default. Every setting referenced across these docs — HA replicas, the rollback store, webhook mode, NetworkPolicy, the ServiceMonitor, and the rest — is a Helm value. The [chart's README and `values.yaml`](https://github.com/metio/helm-charts/tree/main/charts/stageset-controller) document the full, current list. ### What the chart installs - The **controller `Deployment`**, its `ServiceAccount`, and the cluster RBAC it needs (a `ClusterRole` + `ClusterRoleBinding`, plus a namespaced leader-election `Role`/`RoleBinding`). - The **CRDs** — `StageSet` and `StageInventory`. - The **validating admission webhook** (`ValidatingWebhookConfiguration` + a webhook `Service`). - A **metrics `Service`** (and an opt-in `ServiceMonitor`). - The **Flagger gate `Service`** for the read-only stage-gate endpoint. - Opt-in extras: `NetworkPolicy`, `PodDisruptionBudget`, `HorizontalPodAutoscaler`, a rollback-store `PersistentVolumeClaim`, and a managed `Namespace`. ### About the CRDs The CRDs ship inside the chart's regular templates (not Helm's special `crds/` directory), so a `helm upgrade` applies schema changes like any other resource. This is governed by `crds.create` (default `true`). The CRDs carry `helm.sh/resource-policy: keep`, so a `helm uninstall` leaves them — and your StageSets — in place; remove them by hand if you really mean to. If you manage CRDs out of band, the raw definitions are also published in the controller repository under `config/crd/` and can be applied with `kubectl apply --server-side -f`. ## Verify ```shell kubectl -n stageset-system get deploy stageset-controller kubectl get crd stagesets.stages.metio.wtf stageinventories.stages.metio.wtf ``` Once the controller is `Available`, create your first [StageSet](/usage/stages-and-sources/). --- # Operations Source: https://stageset.projects.metio.wtf/installation/operations/ ## Metrics The controller registers custom metrics on the controller-runtime registry, served on `--metrics-bind-address` (`:8080`) alongside the standard `controller_runtime_*` and `workqueue_*` series. Enable scraping with the chart's opt-in `ServiceMonitor` (`metrics.serviceMonitor.enabled`): ```yaml # values.yaml metrics: serviceMonitor: enabled: true # needs the Prometheus operator CRDs ``` | Metric | Type | Labels | Meaning | |---|---|---|---| | `stageset_reconcile_total` | counter | `namespace`, `name`, `reason` | Reconciles, by terminal Ready reason. | | `stageset_stage_applied_total` | counter | `namespace`, `name`, `stage` | Stages applied and verified. | | `stageset_drift_corrected_total` | counter | `namespace`, `name`, `stage` | Out-of-band drift re-asserted on a steady-state reconcile. | | `stageset_update_deferred_total` | counter | `namespace`, `name` | Rollouts held by a closed update window. | | `stageset_webhook_cert_renewal_failures_total` | counter | _(none)_ | Failed self-signed webhook cert renewals. | | `stageset_stage_ready` | gauge | `namespace`, `stageset`, `stage` | `1` when a stage is Ready, else `0` — for metric-based [progressive delivery](/tutorials/progressive-delivery/#argo-rollouts). | ## Alerts The chart ships an opt-in `PrometheusRule` with a starter alert set, gated on `metrics.prometheusRule.enabled` (requires the [Prometheus operator](https://prometheus-operator.dev/) CRDs). It covers the custom `stageset_*` metrics plus controller-runtime signals: | Alert | Fires on | Severity | |---|---|---| | `StageSetReconcileErrorsHigh` | per-StageSet Ready=False rate (excludes the healthy `Succeeded`/`Suspended` reasons) | warning | | `StageSetControllerWorkqueueDepthHigh` | the reconcile queue not draining | warning | | `StageSetReconcileLatencyHigh` | reconcile p99 latency over threshold | warning | | `StageSetControllerPodDown` | a controller pod NotReady | critical | | `StageSetWebhookCertRenewalFailing` | self-signed cert rotation failing | critical | Every threshold is a knob under `metrics.prometheusRule.thresholds`, and `extraAlertLabels` is merged onto every rendered alert so all stageset alerts can route through one Alertmanager receiver. Each alert carries a `runbook_url` annotation pointing at the matching [runbook](/runbooks/) page on this site (`metrics.prometheusRule.runbookBaseURL`); the reconcile-errors alert templates the URL on `$labels.reason`. Append your own rules under `metrics.prometheusRule.extraRules`, and silence a built-in alert by raising its threshold rather than forking the chart. ## Events The controller emits Kubernetes Events on every Ready-condition transition, so `kubectl describe stageset ` and [Flux](https://fluxcd.io/)'s `notification-controller` (via an `Alert` targeting `kind: StageSet`) both surface what happened. Normal events include `Succeeded`, `UpdateDeferred`, `MigrationStarted`, and `MigrationCompleted`; warnings include `StageFailed`, `DriftCorrected`, `RolledBack`, `MigrationFailed`, `OnFailureAction`, and `RollbackStoreFailed`. ## Runbooks Every actionable Ready-condition reason has a [runbook](/runbooks/) covering the symptom, cause, diagnosis, and remediation. Set `--runbook-base-url` (the chart's `controller.runbookBaseURL`, which defaults to this docs site) to a published copy of those pages and the controller appends `(runbook: //)` to the Ready message (the reason lower-cased into a path segment), so a `kubectl describe` links straight to the fix. Healthy reasons (`Succeeded`, `Suspended`) get no link. ```yaml # values.yaml — point at your own mirror, or set "" to drop the links controller: runbookBaseURL: https://runbooks.internal/stageset ``` For example, a `StageFailed` StageSet then shows: ```text Message: stage "application" failed: … (runbook: https://runbooks.internal/stageset/stagefailed/) ``` ## Forcing a reconcile The controller reconciles on its `spec.interval`, on source changes, and on demand. To trigger an out-of-band run, stamp the standard annotation — which is what `flux reconcile` and [`stagesetctl reconcile`](/cli/reconcile/) do for you: ```shell kubectl annotate stageset my-app \ reconcile.fluxcd.io/requestedAt="$(date -u +%FT%TZ)" --overwrite ``` The handled token is recorded in `status.lastHandledReconcileAt`. ## Drift correction On a steady-state reconcile the controller re-asserts the desired state, healing out-of-band changes to managed objects. Each correction emits a `DriftCorrected` event and increments `stageset_drift_corrected_total`. Tighten the cadence with `spec.driftDetectionInterval` when you need faster healing than `spec.interval`. --- # Production Source: https://stageset.projects.metio.wtf/installation/production/ ## High availability The controller supports leader-elected HA. Enable leader election and run more than one replica; only the lease holder reconciles, while every replica answers admission webhook calls (admission must stay available even on non-leaders). - Leader election is toggled with `--leader-elect`. The binary defaults it to `false`, but the **Helm chart enables it by default** (`controller.leaderElect: true`), so a default install is already lease-guarded even at one replica. - The lease is named `stageset-controller.stages.metio.wtf` and lives in the controller's namespace. It uses controller-runtime's default timing (~15 s lease duration). The lease is **not** released eagerly on shutdown, so after a rolling update the new leader takes over when the old lease expires — budget a few seconds of reconcile pause on restart (admission and the gate endpoint are unaffected). - Scaling: when the chart's `replicas.max` exceeds `replicas.min` it renders a `HorizontalPodAutoscaler` (CPU target 80%) and a `PodDisruptionBudget` (`minAvailable: 1`). At the default 1/1 it sets neither and leaves `spec.replicas` unmanaged. The controller watches every namespace by default. Multi-tenancy is enforced per `StageSet` through impersonation (see below). You can additionally scope the controller to a namespace set with `controller.watchNamespaces` — one controller instance per tenant-group — and run it under `cluster-admin` for single-tenant clusters; both are covered in [multi-cluster and tenancy](/usage/multi-cluster/). ## Hardening Each option below is shown as the Helm values that configure it. Several are already the chart's defaults, shown so you can see what is applied and override it for a stricter policy. ### Tenant impersonation The controller never applies your manifests with its own identity. Every cluster operation for a `StageSet` — building, applying, pruning, running actions — is performed impersonating the `StageSet`'s `spec.serviceAccountName` (the chart grants the controller `impersonate`, not write access). A `StageSet` can only do what its tenant SA permits; an over-broad or missing SA fails closed. This one lives on the `StageSet`, not in the chart — give every production `StageSet` a scoped `ServiceAccount`: ```yaml apiVersion: stages.metio.wtf/v1 kind: StageSet metadata: { name: payments, namespace: payments } spec: serviceAccountName: payments-deployer # scoped to exactly this release's needs # … ``` ### Pod security context The chart runs a non-root, read-only-root-filesystem pod with all capabilities dropped, on a `gcr.io/distroless/static:nonroot` image (no shell or package manager). These are the rendered defaults: ```yaml podSecurityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault securityContext: runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: [ALL] seccompProfile: type: RuntimeDefault ``` ### Resource limits Requests equal limits, so the pod is fully constrained: ```yaml resources: cpu: 50m memory: 256Mi ephemeralStorage: 32Mi # /tmp and the self-signed cert dir are emptyDirs ``` ### Pod-Security Standards namespace Have the chart create the install namespace with restricted PSS labels: ```yaml namespace: create: true pssLevel: restricted # or: baseline / privileged ``` ### Network policy The gate endpoint is **unauthenticated** (read-only `GET /gate/{namespace}/{stageset}/{stage}`). Turn on the ingress-only NetworkPolicy to fence it — and the webhook/metrics ports — to only the peers that need them: ```yaml networkPolicy: enabled: true # admits the webhook (9443), metrics (8080), gate (8082) ``` The policy is **ingress-only**, so it does not restrict egress — the controller can still fetch stage artifacts over HTTP from source-controller (an `ExternalArtifact` or a `GitRepository`/`OCIRepository`/`Bucket` is served from the same artifact endpoint). If your cluster default-denies egress, add an egress allowance to source-controller (and DNS) so those fetches succeed. ### Admission webhook TLS `webhook.certMode` chooses how the webhook serving certificate is obtained: ```yaml webhook: certMode: cert-manager # cert-manager issues + rotates the cert (requires cert-manager) # certMode: self-signed # chart default: in-pod CA + serving cert, rotated at # validity/3, with no cert-manager dependency ``` ## Reference setups Two HA shapes — on-prem with shared RWX storage, and AWS/EKS with S3 — over the same backbone: a leader-elected pair (or trio), a rollback store reachable from whichever pod holds the lease, cert-manager for the webhook, a `NetworkPolicy` fencing the unauthenticated gate, and a `ServiceMonitor` if you run Prometheus. Both run two replicas for [HA](#high-availability) (`replicas.max` above `replicas.min` also renders a PDB and an HPA) and set `webhook.certMode: cert-manager`, so [cert-manager](https://cert-manager.io/) must be installed in the cluster. ### On-prem (RWX storage) The rollback store gives bit-exact rollbacks that outlive producer GC. With HA replicas it must be reachable from whichever pod holds the lease, so use a `ReadWriteMany` PVC on your on-prem storage class — every replica mounts the same volume. ```yaml # values-onprem.yaml replicas: min: 2 # leader-elected HA; the non-leader still serves admission max: 3 # > min renders an HPA (CPU 80%) and a PodDisruptionBudget controller: leaderElect: true rollbackStore: backend: pvc pvc: accessModes: [ReadWriteMany] storageClass: nfs-client # your RWX class (NFS, CephFS, …) size: 10Gi webhook: certMode: cert-manager # requires cert-manager in the cluster networkPolicy: enabled: true # fences the unauthenticated gate endpoint metrics: serviceMonitor: enabled: true ``` ```shell helm upgrade --install stageset-controller \ oci://ghcr.io/metio/helm-charts/stageset-controller \ --namespace stageset-system --create-namespace \ -f values-onprem.yaml ``` ### AWS / EKS (S3) On EKS, back the rollback store with S3 and let the controller assume an IAM role through [IRSA](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) — no static keys. Annotate the controller's ServiceAccount with the role ARN and leave the S3 credentials empty; the store's minio-go client picks the role up from the pod's web-identity token. ```yaml # values-eks.yaml replicas: min: 2 max: 3 controller: leaderElect: true serviceAccount: annotations: # an IAM role granting s3:GetObject/PutObject/ListBucket/DeleteObject on the bucket eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/stageset-controller rollbackStore: backend: s3 s3: endpoint: s3.eu-west-1.amazonaws.com bucket: my-org-stageset-rollback region: eu-west-1 # no existingSecret → credentials come from the IRSA role above webhook: certMode: cert-manager networkPolicy: enabled: true metrics: serviceMonitor: enabled: true ``` ```shell helm upgrade --install stageset-controller \ oci://ghcr.io/metio/helm-charts/stageset-controller \ --namespace stageset-system --create-namespace \ -f values-eks.yaml ``` ### Alongside the other Flux controllers `stageset-controller` is a [Flux](https://fluxcd.io/) citizen and needs no special wiring to coexist with `source-controller`, `kustomize-controller`, `helm-controller`, and `notification-controller`. It reads `ExternalArtifact` (and the standard `GitRepository`, `OCIRepository`, and `Bucket` sources) from `source-controller`, and `notification-controller` routes its events through an `Alert` that targets `kind: StageSet` — no Provider/Alert plumbing of its own. Install it in its own namespace (e.g. `stageset-system`) next to `flux-system`; the only cluster-scoped pieces are its CRDs, `ClusterRole`, and webhook configuration. ### Alongside JaaS [JaaS](https://jaas.projects.metio.wtf/) renders Jsonnet and publishes the result as an `ExternalArtifact`, which is what a `StageSet` stage consumes — so the two compose directly. Reference the artifact by name, or name the producing `JsonnetSnippet` and let `stageset-controller` resolve it (see [producer-aware sources](/usage/producer-aware-sources/)). They can share a cluster and namespace or stay separate; both are reconciled by Flux and both apply under per-tenant impersonation, so the security model is consistent end to end. ## Settings you can tune The chart wires the controller; you set Helm values. The set worth thinking about is below — each row is the value, its default, and when you'd change it. Everything else the chart configures for you (see [what the chart manages](#what-the-chart-manages)). | Helm value | Default | When to change | |---|---|---| | `replicas.min` / `replicas.max` | `1` / `1` | Raise both to ≥ 2 for HA; set `max > min` to also render an HPA + PDB. | | `controller.leaderElect` | `true` | Leave on — harmless at one replica, required for HA. | | `controller.defaultInterval` | `10m` | The reconcile cadence StageSets inherit when they omit `spec.interval`. Lower for faster drift correction cluster-wide. | | `controller.inventoryMode` | `hybrid` | `applyset` for ApplySet-native tooling; `entries` to drop the ApplySet labels. | | `controller.inventoryShardCap` | `5000` | Lower only if a stage applies a huge object count and you want smaller inventory objects. | | `controller.allowedActionHosts` | `[]` | Add host globs your `http` [actions](/usage/actions/) must reach (loopback/link-local are always denied). | | `controller.noCrossNamespaceRefs` | `false` | `true` to hard-isolate namespaces (deny cross-namespace `sourceRef`/`dependsOn`). | | `controller.watchNamespaces` | `[]` | Restrict the controller to a namespace list (cache + RBAC pivot to per-namespace bindings); empty watches cluster-wide. See [tenancy](/usage/multi-cluster/#scoping-the-controller-to-a-namespace-set). | | `rbac.clusterAdmin` | `false` | `true` on **single-tenant** clusters to bind the controller SA to `cluster-admin` so StageSets apply without `serviceAccountName`. See [single-tenant](/usage/multi-cluster/#single-tenant-cluster-admin). | | `controller.runbookBaseURL` | the docs site | Point at a fork/mirror, or empty to drop the runbook links from Ready messages. | | `webhook.certMode` | `self-signed` | `cert-manager` if you run cert-manager — see [reference setups](#reference-setups). | | `gate.enabled` | `true` | Leave on for [progressive delivery](/tutorials/progressive-delivery/) (the Flagger/Argo gate); set `false` to drop the gate Service and endpoint. | | `rollbackStore.backend` | `none` | `pvc` (RWX) or `s3` to enable [`spec.rollbackOnFailure`](/usage/rollback/); the two are mutually exclusive. | | `rollbackStore.s3.sse` | `s3` | At-rest encryption for the S3 store (it holds rendered Secret data): `s3` (SSE-S3), `kms` (+`sseKmsKeyId`), or `none`. See [encryption at rest](/usage/rollback/#encryption-at-rest). | | `networkPolicy.enabled` | `false` | `true` to fence the controller and the unauthenticated gate. | | `metrics.serviceMonitor.enabled` | `false` | `true` if you scrape with the Prometheus operator. | | `metrics.prometheusRule.enabled` | `false` | `true` for the bundled [alerts](/installation/operations/#alerts). | | `serviceAccount.annotations` | `{}` | An IRSA role ARN on EKS so the S3 store uses an IAM role. | | `namespace.create` | `false` | `true` to have the chart create the install namespace with Pod-Security labels. | | `resources` | requests = limits | Raise for very large or very busy releases. | Every option is set the same way — in your values file, applied with `helm upgrade --install … -f values.yaml`. The [reference setups](#reference-setups) above are complete, copy-pasteable examples. ## What the chart manages You do **not** configure these — the chart wires them so the controller behaves correctly out of the box: - **Leader election and HA plumbing** — the lease, and the PDB/HPA when `replicas.max > replicas.min`. - **The admission webhook** — the server, its Service, the `ValidatingWebhookConfiguration`, and the certificate (cert-manager `Certificate` or the in-pod self-signed renewer, per `webhook.certMode`). - **Endpoints** — metrics, health probes, and the gate, on their Services. - **RBAC** — the ClusterRole/bindings the controller needs, including the `impersonate` verb (it never applies as itself). - **A hardened pod** — non-root, read-only root filesystem, dropped capabilities, seccomp `RuntimeDefault` (see [pod security context](#pod-security-context)). - **Per-tenant impersonation** — every apply runs as the StageSet's `spec.serviceAccountName`. ## Controller flags The chart sets the controller's command-line flags from your Helm values and its own defaults — you never pass them directly. For the exhaustive per-flag list with defaults, see the [Configuration reference](/installation/configuration/), which also notes which Helm value drives each one. --- # Tutorials Source: https://stageset.projects.metio.wtf/tutorials/ End-to-end walkthroughs that stitch several pieces together. Where the [usage](/usage/) pages each cover one feature in isolation, these follow a whole task from start to finish. --- # From Jsonnet to a gated rollout Source: https://stageset.projects.metio.wtf/tutorials/jsonnet-to-rollout/ This tutorial follows a complete delivery: write [Kubernetes](https://kubernetes.io/docs/) manifests in [Jsonnet](https://jsonnet.org/) and publish the source through [Flux](https://fluxcd.io/); [JaaS](https://jaas.projects.metio.wtf/) renders it into a Flux `ExternalArtifact`, and a StageSet rolls it out with a readiness gate. The chain is: ```text Jsonnet in Git/OCI/Bucket → JaaS (JsonnetSnippet) → ExternalArtifact → StageSet ``` This tutorial renders *Jsonnet*, so it goes through JaaS: JaaS turns the Jsonnet into an `ExternalArtifact` the stage consumes. (If your manifests were already plain YAML, a stage could read a `GitRepository`/`OCIRepository`/`Bucket` directly — see [Stage sources](/tutorials/flux-sources/). The renderer is here because the input is Jsonnet, not because StageSet can't read Git.) ## Prerequisites - Flux installed (with the `ExternalArtifact` API — Flux ≥ v2.7.0). - [JaaS](https://jaas.projects.metio.wtf/) installed in operator mode. - StageSet installed (see [Installation](/installation/kubernetes/)). - An `apps` namespace, and a `web-deployer` `ServiceAccount` in it whose RBAC can apply the workload (the StageSet impersonates it): ```shell kubectl create namespace apps kubectl -n apps create serviceaccount web-deployer # bind web-deployer to a Role/ClusterRole that can manage Deployments and # Services in the apps namespace — see /usage/multi-cluster/ for the tenancy model ``` ## 1. Write the manifests in Jsonnet A small web app, parameterized as a Jsonnet top-level function so the same source renders for any environment. Commit this as `jsonnet/main.jsonnet` in a Git repo: ```jsonnet // jsonnet/main.jsonnet function(name='web', image='registry.internal/web:latest', replicas='2') { apiVersion: 'v1', kind: 'List', items: [ { apiVersion: 'apps/v1', kind: 'Deployment', metadata: { name: name }, spec: { replicas: std.parseInt(replicas), selector: { matchLabels: { app: name } }, template: { metadata: { labels: { app: name } }, spec: { containers: [{ name: name, image: image }] }, }, }, }, { apiVersion: 'v1', kind: 'Service', metadata: { name: name }, spec: { selector: { app: name }, ports: [{ port: 80, targetPort: 8080 }] }, }, ], } ``` Rendering a `kind: List` keeps several resources in one document — both the kustomize build the controller runs and `kubectl` flatten it transparently. ## 2. Publish the source through Flux Point a Flux `GitRepository` at the repo so the cluster has the Jsonnet: ```yaml apiVersion: source.toolkit.fluxcd.io/v1 kind: GitRepository metadata: name: web-manifests namespace: apps spec: interval: 1m url: https://github.com/acme/web-manifests ref: branch: main ``` Apply it and wait for the source to sync: ```shell kubectl apply -f gitrepository.yaml kubectl -n apps wait --for=condition=Ready gitrepository/web-manifests ``` ## 3. Render with JaaS A `JsonnetSnippet` reads the Jsonnet from that source, passes the parameters as top-level arguments, and publishes the rendered result as an `ExternalArtifact`: ```yaml apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet metadata: name: web namespace: apps spec: sourceRef: kind: GitRepository name: web-manifests path: ./jsonnet entryFile: main.jsonnet tlas: # top-level args → the function() parameters name: ["web"] image: ["registry.internal/web:2.1.0"] replicas: ["3"] ``` Apply it; JaaS then publishes an `ExternalArtifact` named `web` in the `apps` namespace. Confirm it went Ready: ```shell kubectl apply -f jsonnetsnippet.yaml kubectl -n apps get externalartifact web ``` ## 4. Roll it out with StageSet Reference the `JsonnetSnippet` as the stage source — StageSet resolves the producer to its `ExternalArtifact` — and gate the stage on the Deployment becoming available: ```yaml apiVersion: stages.metio.wtf/v1 kind: StageSet metadata: name: web namespace: apps spec: serviceAccountName: web-deployer # applies are impersonated as this SA stages: - name: web sourceRef: apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet name: web readyChecks: checks: - apiVersion: apps/v1 kind: Deployment name: web ``` Apply it, preview the change before it lands, then watch it roll out: ```shell kubectl apply -f stageset.yaml stagesetctl diff web -n apps # preview against live cluster state stagesetctl get web -n apps # per-stage progress ``` ## 5. Ship a change Edit `jsonnet/main.jsonnet` (or bump the `image` TLA on the snippet) and commit. Flux pulls the new commit, JaaS re-renders and republishes the `ExternalArtifact`, and StageSet — watching the producer — reconciles the new revision through the same gate. No StageSet edit required. ### No labels or annotations needed You do **not** annotate or label anything to make this chain fire. The linkage is the `sourceRef` itself: the controller watches the source *kinds* (`ExternalArtifact`, `GitRepository`, `OCIRepository`, `Bucket`, and producers like `JsonnetSnippet`) and, when one changes, maps it back to every StageSet whose `sourceRef` points at it — then reconciles those. JaaS works the same way for a snippet's own `sourceRef` and library references. Discovery is automatic; you only declare the references. ## Versioning the rollout To gate one-time [migrations](/usage/versioned-migrations/) on a release boundary, declare the version. The simplest is to pin it on the StageSet, bumped alongside the image: ```yaml spec: version: value: "2.1.0" migrations: - name: backfill-2-0 to: "2.0.0" # runs once when the deployed version crosses 2.0.0 stage: web actions: - name: backfill job: sourceRef: name: web-migrations stages: - name: web sourceRef: apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet name: web ``` ### Let the version travel with the rendered manifests Pinning works, but the cleaner pattern is to let the version ride *inside* the manifests the snippet renders — so a single value flows from your CI all the way to the rollout gate. Feed the version into the snippet and stamp it onto the standard `app.kubernetes.io/version` label (and the image tag, from the same value): ```jsonnet // web.jsonnet local version = std.extVar('version'); // supplied by JaaS extVars / your CI { apiVersion: 'apps/v1', kind: 'Deployment', metadata: { name: 'web', labels: { 'app.kubernetes.io/version': version }, // ← the version, in the manifest }, spec: { template: { metadata: { labels: { 'app.kubernetes.io/version': version } }, spec: { containers: [{ name: 'web', image: 'registry.example/web:' + version }] }, }, }, } ``` Then point `version.fromObject` at that object and drop the inline `value` — the controller reads the label off the rendered `Deployment`: ```yaml spec: version: fromObject: stage: web kind: Deployment name: web # defaults to the app.kubernetes.io/version label migrations: - name: backfill-2-0 to: "2.0.0" stage: web actions: - name: backfill job: sourceRef: name: web-migrations stages: - name: web sourceRef: apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet name: web ``` Now the version has exactly one source of truth — the value your pipeline feeds the snippet — and it shows up in the image tag, the version label, *and* the migration gate together. The same `fromObject` works for a `GitRepository`/`OCIRepository` source too; only a source that ships a dedicated file wants [`version.fromArtifact`](/usage/versioned-migrations/#from-a-file-in-the-artifact--versionfromartifact) instead. See [versioned migrations](/usage/versioned-migrations/) for all three. ## Next From here, add more [stages](/usage/stages-and-sources/), pre/post [actions](/usage/actions/), or [update windows](/usage/update-windows/) to turn this single rollout into a gated, multi-stage release. To parameterize per environment, see [Parameters](/tutorials/parameters/). --- # Parameterizing a rollout Source: https://stageset.projects.metio.wtf/tutorials/parameters/ A rollout takes parameters at two distinct layers, which serve different purposes: - **Render-time parameters (JaaS).** Change *what gets rendered*. The Jsonnet computes its output from top-level arguments (`tlas`) and external variables (`externalVariables`). Different values produce a different `ExternalArtifact`. - **Delivery-time parameters (StageSet `postBuild`).** Inject values *into already-rendered manifests*, per stage, by string substitution — the same mechanism Flux's `kustomize-controller` uses. Use render-time parameters for structural logic; use delivery-time parameters to stamp environment-specific values onto a shared artifact. ## Render-time: JaaS TLAs and external variables Top-level arguments map to a Jsonnet `function(...)`: ```jsonnet // main.jsonnet function(name='web', replicas='2') { apiVersion: 'apps/v1', kind: 'Deployment', metadata: { name: name }, spec: { replicas: std.parseInt(replicas) /* … */ } } ``` ```yaml apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet metadata: name: web namespace: apps spec: sourceRef: { kind: GitRepository, name: web-manifests, path: ./jsonnet } tlas: # → function(name, replicas) name: ["web"] replicas: ["3"] externalVariables: # → std.extVar('environment') environment: "production" ``` `tlas` is a map of name → list of values (a single-element list for a scalar argument; multiple values become a JSON array). `externalVariables` are plain strings read with `std.extVar`. ## Delivery-time: StageSet postBuild substitution When the rendered manifests carry `${var}` placeholders, a stage substitutes them at apply time — from inline values, ConfigMaps, and Secrets: ```yaml apiVersion: stages.metio.wtf/v1 kind: StageSet metadata: name: web namespace: apps spec: stages: - name: web sourceRef: apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet name: web postBuild: substitute: cluster_name: prod-eu substituteFrom: - kind: ConfigMap name: cluster-vars - kind: Secret name: cluster-secrets optional: true ``` A manifest field like `value: "${cluster_name}"` becomes `value: "prod-eu"` for this stage. ## Reusing one artifact across environments The two layers combine into a common pattern: render an environment-*agnostic* artifact once with JaaS, then have several StageSets — one per environment — consume that same artifact and stamp their own values with `postBuild`: ```yaml # staging spec: stages: - name: web sourceRef: { apiVersion: jaas.metio.wtf/v1, kind: JsonnetSnippet, name: web } postBuild: substituteFrom: - { kind: ConfigMap, name: staging-vars } --- # production (same artifact, different values) spec: stages: - name: web sourceRef: { apiVersion: jaas.metio.wtf/v1, kind: JsonnetSnippet, name: web } postBuild: substituteFrom: - { kind: ConfigMap, name: production-vars } ``` One render, many environments — each StageSet bounded by its own [ServiceAccount](/usage/multi-cluster/) and gated by its own [actions](/usage/actions/) and [update windows](/usage/update-windows/). --- # Progressive delivery Source: https://stageset.projects.metio.wtf/tutorials/progressive-delivery/ `StageSet` integrates with both progressive-delivery controllers: [Flagger](https://flagger.app/) and [Argo Rollouts](https://argoproj.github.io/argo-rollouts/). The controller exposes a read-only gate endpoint and a readiness gauge so either one can hold a promotion until a `StageSet` stage is healthy; ready checks let a stage wait on a Rollout in return. Pick the section for your controller below — see also [StageSet vs Argo Rollouts](/comparisons/argo-rollouts/). ## The gate contract The gate endpoint backs the Flagger integration and the Argo Rollouts JSON-metric option. ```text GET /gate/{namespace}/{stageset}/{stage} 200 — the stage is Ready at the currently pinned revision 403 — the stage is not Ready (or not found / not gateable) ``` It is served on `--gate-bind-address` (default `:8082`) and exposed by the chart's `stageset-controller-gate` Service (`gate.enabled`, on by default). The endpoint is **unauthenticated and read-only**, so fence it with a `NetworkPolicy` ([production](/installation/production/#network-policy)) to admit only your delivery controller. ## Flagger Add a `confirm-promotion` (or `confirm-rollout`) webhook to a Flagger `Canary` pointing at the gate. Flagger blocks the promotion until the gate returns `200`: ```yaml apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: web namespace: apps spec: targetRef: apiVersion: apps/v1 kind: Deployment name: web analysis: interval: 1m threshold: 5 stepWeight: 10 maxWeight: 50 webhooks: - name: stageset-stage-ready type: confirm-promotion # gate this canary's promotion on a StageSet stage being Ready url: http://stageset-controller-gate.stageset-system:8082/gate/apps/web/web ``` This is independent of the Flagger *strategy*: the same webhook gates a weighted **canary**, an **A/B test** (header/cookie routing), or a **blue-green** promotion — the gate only answers "is this stage Ready," and Flagger decides what to do with that answer. This coordinates two moving parts: Flagger shifts traffic to a new version only once a StageSet stage that applied the supporting config (a CRD, a migration, a sibling component) reports Ready. ## Argo Rollouts Argo Rollouts gates on **analysis metrics** (a query that returns a value to compare) rather than a webhook's HTTP status, so the controller meets it on its own terms in two ways. ### Gate on the readiness gauge (recommended) The controller exports `stageset_stage_ready{namespace,stageset,stage}` (`1` when the stage is Ready, `0` otherwise). Argo's **Prometheus** metric provider gates on it directly — no gate endpoint, no Job: ```yaml apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: stageset-stage-ready namespace: apps spec: metrics: - name: stage-ready successCondition: result == 1 provider: prometheus: address: http://prometheus.monitoring:9090 query: max(stageset_stage_ready{namespace="apps",stageset="web",stage="web"}) ``` ### Gate on the JSON endpoint The same gate endpoint also answers JSON when asked (`Accept: application/json`), returning `{"ready": true, …}` with a `200` so Argo's **web** metric can parse it (Argo treats a non-2xx as an error, so readiness has to live in the body): ```yaml spec: metrics: - name: stage-ready successCondition: "result.ready == true" provider: web: url: http://stageset-controller-gate.stageset-system:8082/gate/apps/web/web headers: - key: Accept value: application/json jsonPath: "{$}" ``` A **Job-based metric** (`curl -fsS …` against the gate, succeeding only on `200`) is the fallback when the analysis has no Prometheus or web access. ## The reverse direction: gate a StageSet on a Rollout The coordination also works the other way. Because [ready checks](/usage/ready-checks/) accept CEL, a StageSet stage can wait on an Argo `Rollout` finishing its own progressive rollout before the next stage runs: ```yaml readyChecks: exprs: - apiVersion: argoproj.io/v1alpha1 kind: Rollout current: "status.phase == 'Healthy'" inProgress: "status.phase in ['Progressing', 'Paused']" failed: "status.phase == 'Degraded'" ``` So StageSet can gate Argo (via the gauge/gate) and Argo's outcome can gate StageSet (via ready checks) — pick whichever direction your release needs. --- # Quickstart Source: https://stageset.projects.metio.wtf/tutorials/quickstart/ This tutorial takes you from an empty cluster to one running StageSet. The path is the shortest one — a single stage pointing directly at a Flux `GitRepository` that already holds plain manifests. No Jsonnet, no migrations, no optional knobs. ## Prerequisites - A [Kubernetes](https://kubernetes.io/docs/) cluster with `kubectl` configured against it. - `helm` 3.x. - [Flux](https://fluxcd.io/) **v2.7.0 or newer** — the `ExternalArtifact` CRD a stage resolves to lands in that version. See [Install on Kubernetes](/installation/kubernetes/#prerequisites) for the full prerequisites. ## Step 1 — Install the controller ```shell helm upgrade --install stageset-controller \ oci://ghcr.io/metio/helm-charts/stageset-controller \ --namespace stageset-system --create-namespace \ --wait --timeout 5m ``` See [Install on Kubernetes](/installation/kubernetes/) for the full list of chart values (HA replicas, rollback store, webhook TLS mode, and so on). Verify the controller is running: ```shell kubectl -n stageset-system get deploy stageset-controller # NAME READY UP-TO-DATE AVAILABLE AGE # stageset-controller 1/1 1 1 30s ``` ## Step 2 — Provide a source A stage reads from a Flux source. The quickest path is a `GitRepository` pointing at a repo that contains plain Kubernetes manifests: ```shell cat <= 3" timeout: 5m ``` ### `patch` Patch an existing object — flip a feature flag, scale something, annotate. `type` is `merge` (default) for a strategic-merge patch, or `json6902` for a JSON Patch: ```yaml - name: enable-traffic patch: target: apiVersion: v1 kind: Service name: web type: merge # default; or json6902 patch: | { "spec": { "selector": { "release": "green" } } } ``` ### `delete` Remove an existing object; a missing object counts as success. ```yaml - name: drop-old-job delete: target: apiVersion: batch/v1 kind: Job name: legacy-migration ``` ### `apply` Apply transient, rollout-scoped manifests that are **not** inventory-tracked and are never pruned — a maintenance page, a one-shot canary, a temporary config. With `wait: true` the action blocks until the applied objects report Ready (kstatus), bounded by the action `timeout`, so a following `patch` can repoint traffic only once the resource is serving. Because the applied objects are never pruned by the inventory diff, stand a resource up only for the duration of a rollout by pairing an `apply` in `pre` with a matching `delete` in `post`, and guard a mid-run crash with an `onFailure` delete: ```yaml actions: pre: - name: stand-up-maintenance-page apply: sourceRef: name: maintenance-page # an ExternalArtifact holding a Pod + Service wait: true # block until it is serving post: - name: tear-down-maintenance-page delete: target: apiVersion: v1 kind: Pod name: maintenance-page onFailure: - name: tear-down-maintenance-page-on-failure delete: target: apiVersion: v1 kind: Pod name: maintenance-page ``` The action ledger gates each step per pinned revision, so a retry or controller restart never re-applies or re-deletes the resource for the same snapshot. To run a `job` action only when the deployed version crosses a release boundary, see [versioned migrations](/usage/versioned-migrations/). --- # Conflict policies Source: https://stageset.projects.metio.wtf/usage/conflict-policies/ Conflict policies decide what happens when an apply hits an immutable-field conflict — a changed `clusterIP`, a `Job` pod template, a `StorageClass` field that can't be updated in place. By default the controller fails the stage and reports it, so nothing destructive happens by surprise. A policy opts specific resources into automatic resolution. ## The three actions - `Fail` — stop and report (the default; safest). - `Recreate` — delete and re-create the object to get past an immutable-field change. - `KeepExisting` — leave the live object as-is and move on. ## A default for the whole stage ```yaml spec: stages: - name: app sourceRef: name: my-app conflictPolicy: default: Fail # explicit; the safe default ``` The `force: true` shorthand on a stage is equivalent to `conflictPolicy.default: Recreate`. ## Per-resource rules Rules recreate exactly the resources that need it while everything else stays `Fail`. A rule's `target` is a partial selector — any field you omit matches everything. Rules are evaluated in list order; the **first** rule whose target matches wins, and an object matching no rule falls back to `default`. ```yaml conflictPolicy: default: Fail rules: # a Job's pod template is immutable — recreate it on change - target: apiVersion: batch/v1 kind: Job action: Recreate # never fight an HPA over replica counts - target: kind: Deployment name: web action: KeepExisting ``` ## Recreating storage Recreating a `PersistentVolumeClaim` or `PersistentVolume` destroys data, so a `Recreate` **rule** targeting one is refused unless you explicitly accept the loss: ```yaml rules: - target: kind: PersistentVolumeClaim name: scratch action: Recreate allowDataLoss: true # required for PVC/PV Recreate, refused otherwise ``` Without `allowDataLoss: true`, a `Recreate` rule targeting a PVC/PV is rejected — a guardrail against accidentally wiping a volume. --- # Multi-cluster and tenancy Source: https://stageset.projects.metio.wtf/usage/multi-cluster/ There are two ways to run the controller, and they map onto two different trust models. Pick the one that matches your cluster: - **Multi-tenant** — the controller holds no write access of its own and applies every `StageSet` impersonating that `StageSet`'s `serviceAccountName`. Each tenant's RBAC bounds what its releases can touch. This is the chart default. - **Single-tenant** — the cluster has one operator, so per-tenant isolation buys nothing. Run the controller under its own identity bound to `cluster-admin` and skip impersonation entirely — the model Flux's `helm-controller` uses in its default install. The two sections below set each one up. The optional [watch scoping](#scoping-the-controller-to-a-namespace-set) narrows *which* namespaces a multi-tenant controller sees. ## Impersonation (multi-tenant) The controller never applies your manifests as itself. Set `serviceAccountName` and every operation for that `StageSet` — build, apply, prune, actions — is performed impersonating that ServiceAccount. The `StageSet` can do exactly what the SA's RBAC permits, and nothing more. ```yaml spec: serviceAccountName: payments-deployer # all writes impersonate this SA stages: - name: app sourceRef: name: payments-app ``` Grant the SA only the rights that release needs: ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: payments-deployer namespace: payments roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: edit subjects: - kind: ServiceAccount name: payments-deployer namespace: payments ``` This is the multi-tenancy model: isolation comes from each `StageSet` being bounded by its tenant SA, not from the controller's own grant — by default the chart gives the controller `impersonate` and read access, no blanket write. A `StageSet` with no `serviceAccountName`, or one bound to a too-narrow SA, fails closed rather than escalating. ## Single-tenant cluster-admin On a cluster with a single operator, per-`StageSet` impersonation is friction with no payoff — there is no other tenant to isolate from. Run the controller the way Flux's `helm-controller` runs by default: under its own ServiceAccount, bound to the built-in `cluster-admin` ClusterRole. `StageSet`s then omit `serviceAccountName` and apply as the controller, which can write any kind cluster-wide. Turn it on with one Helm value: ```yaml rbac: clusterAdmin: true # bind the controller SA to cluster-admin ``` ```bash helm upgrade --install stageset-controller \ oci://ghcr.io/metio/helm-charts/stageset-controller \ -n stageset-system --create-namespace \ --set rbac.clusterAdmin=true ``` `StageSet`s then need nothing tenancy-related — they apply directly: ```yaml apiVersion: stages.metio.wtf/v1 kind: StageSet metadata: name: platform namespace: stageset-system spec: stages: - name: app sourceRef: name: platform-app # applied by the controller's cluster-admin identity ``` When `serviceAccountName` is unset and no `kubeConfig` is given, the controller applies with its own client — so the `cluster-admin` binding is what lets those `StageSet`s write. The trade-off: every `StageSet` on the cluster has full write access, so this is for single-tenant clusters only. Leave `rbac.clusterAdmin` at its default `false` and use [impersonation](#impersonation-multi-tenant) whenever more than one team shares the cluster. The two mix — a cluster-admin controller still honors `serviceAccountName` on any `StageSet` that sets it, dropping to that SA's rights for that release. ## Scoping the controller to a namespace set By default the controller watches every namespace. To run one controller per tenant-group instead — disjoint deployments that each see only their own namespaces — set `controller.watchNamespaces`: ```yaml controller: watchNamespaces: - team-a - team-b ``` This does two things together: - **Cache scoping.** The manager's informers only observe `StageSet`s and sources in the listed namespaces. Resources elsewhere never enter the cache, so the controller cannot act on them even if RBAC would allow it. - **RBAC pivot.** The chart stops binding the tenant ClusterRole cluster-wide and instead renders one `RoleBinding` per listed namespace — defense in depth, so the apiserver also refuses out-of-scope calls. (The cluster-scoped webhook-caBundle grant stays a `ClusterRoleBinding`, since a `ValidatingWebhookConfiguration` is not namespaced.) Run several releases with disjoint `watchNamespaces` lists to shard the cluster across independent controller instances. Combine it with impersonation for the tightest setup: each instance sees only its namespaces, and each `StageSet` is bounded by its tenant SA. ## Remote clusters Point a `StageSet` at another cluster with `kubeConfig`, referencing a Secret that holds a kubeconfig. Combined with `serviceAccountName`, the controller applies to the remote cluster as the impersonated identity there. ```yaml spec: serviceAccountName: payments-deployer kubeConfig: secretRef: name: prod-eu-kubeconfig # key defaults to "value" (the Flux convention); set it to override stages: - name: app sourceRef: name: payments-app ``` The Secret is read with the controller's own identity — connecting to the target cluster is the controller's job — and the kubeconfig payload defaults to the `value` key. A self-contained kubeconfig is required; `configMapRef`-style cloud-provider auth is not supported. Cross-namespace `sourceRef` and `dependsOn` references can be disabled cluster-wide with the controller's `--no-cross-namespace-refs` flag when you want hard namespace isolation. --- # Producer-aware sources Source: https://stageset.projects.metio.wtf/usage/producer-aware-sources/ [Stages and sources](/usage/stages-and-sources/#source-kinds) covers the two direct routes — an `ExternalArtifact` (the default `sourceRef.kind`) or a Flux `GitRepository`/`OCIRepository`/`Bucket`. The third option names the thing that *produces* an artifact and lets the controller find it. This is useful when an operator publishes an `ExternalArtifact` from a custom resource (for example [JaaS](https://jaas.projects.metio.wtf/) rendering Jsonnet). ## Referencing a producer Set `kind` (and `apiVersion`) to a producer resource, and the controller resolves it to the `ExternalArtifact` that producer publishes — the one whose `spec.sourceRef` back-references the producer (matched on group, kind, and name). For example, a [JaaS](https://jaas.projects.metio.wtf/) `JsonnetSnippet` renders Jsonnet and publishes an `ExternalArtifact`; reference the snippet and the controller follows the link: ```yaml spec: stages: - name: dashboards sourceRef: apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet name: grafana-dashboards ``` The controller also watches the common Flux source kinds (`GitRepository`, `OCIRepository`, `Bucket`) so a stage re-reconciles when an upstream source changes. A producer can itself consume another producer first: a JaaS `JsonnetSnippet` can render from the artifact another snippet publishes. That chaining happens on the producer side — see [chaining snippets](https://jaas.projects.metio.wtf/usage/snippet-sources/#chaining-snippets). A stage references only the final producer and reads the `ExternalArtifact` it publishes. ## Related projects JOI, JaaS, and `StageSet` compose end to end: - **[JOI](https://github.com/metio/jsonnet-oci-images)** publishes Jsonnet libraries as single-layer OCI images (usable both as image-volume mounts and as Flux `OCIRepository` sources). - **[JaaS](https://jaas.projects.metio.wtf/)** evaluates Jsonnet — optionally importing those JOI libraries — and publishes the rendered JSON as an `ExternalArtifact`. - **`StageSet`** references the `JsonnetSnippet` (or its artifact) and rolls the result out in ordered, gated stages. Each project is independently useful; a stage reads straight from a `GitRepository`, `OCIRepository`, or `Bucket`, or from any `ExternalArtifact` regardless of what produced it. --- # Ready checks Source: https://stageset.projects.metio.wtf/usage/ready-checks/ Ready checks decide when a stage is healthy enough to let the next stage start. They are purely observational — the controller waits and reports, but takes no action (active steps are [actions](/usage/actions/)). By default, with no `readyChecks` block, the controller waits for **every** object the stage applied to report ready via [kstatus](https://github.com/kubernetes-sigs/cli-utils/tree/master/pkg/kstatus). `readyChecks` lets you narrow that to specific objects (`checks`), add custom health for resources kstatus doesn't understand (`exprs`, [CEL](https://github.com/google/cel-spec)), bound the wait (`timeout`), or skip it entirely (`disableWait`). `checks` and `exprs` may be set together. ## Explicit objects Wait for named objects only — useful when a stage applies many objects but only a few gate the next stage: ```yaml spec: stages: - name: infrastructure sourceRef: name: platform readyChecks: timeout: 5m checks: - apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition name: ledgers.payments.example - apiVersion: apps/v1 kind: Deployment name: ledger-operator namespace: platform-system ``` ## Custom health with CEL For custom resources kstatus doesn't understand, describe readiness with CEL expressions. The shape matches `kustomize-controller`'s `healthCheckExprs`, so expressions are portable. ```yaml readyChecks: exprs: - apiVersion: db.example/v1 kind: Database current: "status.phase == 'Running'" inProgress: "status.phase in ['Pending', 'Provisioning']" failed: "status.phase == 'Failed'" ``` ## Opting out To apply a stage without waiting for readiness (fire-and-forget), disable the wait: ```yaml readyChecks: disableWait: true ``` --- # Rollback Source: https://stageset.projects.metio.wtf/usage/rollback/ When a run fails, the controller can restore the last successfully-applied artifact revisions instead of leaving you on a broken release. Rollback is opt-in and needs somewhere to keep prior revisions. ## Enabling it ```yaml spec: rollbackOnFailure: true stages: - name: app sourceRef: name: my-app ``` On a failed run the controller restores each stage's last-good artifact revision, best-effort, and emits a `RolledBack` event. The coordinates it restores from are recorded in `status.lastAppliedSnapshot`. ## The rollback store Rollback needs the prior revision to still be fetchable, so the controller keeps a copy in a **rollback store**. Configure one on the controller (cluster-wide), via either a shared filesystem or S3: ```text # filesystem (an RWX PersistentVolumeClaim) --rollback-store-path=/var/lib/stageset/rollback # or S3-compatible object storage --rollback-store-s3-endpoint=s3.example.com --rollback-store-s3-bucket=stageset-rollback ``` The two are mutually exclusive. With no store configured, rollback can only use a prior revision the producer itself still retains; a dedicated store makes rollback reliable across producer pruning. ### Encryption at rest The store keeps each stage's rendered output, which includes any `Secret`'s data — including [SOPS](https://github.com/getsops/sops)-decrypted values (see [secrets encryption](/usage/encryption/)). Treat it as sensitive and keep it encrypted at rest: - **S3** encrypts by default. `--rollback-store-s3-sse` (chart: `rollbackStore.s3.sse`) is `s3` (SSE-S3) out of the box; set `kms` with `rollbackStore.s3.sseKmsKeyId` for SSE-KMS, or `none` only for a backend that cannot honor an SSE header. A rejected SSE write is non-fatal — it warns via a `RollbackStoreFailed` event and skips the store write; the rollout still succeeds. - **Filesystem** can't encrypt itself — back the PVC with an **encrypted volume** (an encrypted `StorageClass`, LUKS, or cloud-disk encryption). The controller logs a reminder at startup when the file store is enabled. If a restore can't proceed because the previous revision is gone, the run fails with the `PreviousRevisionUnavailable` reason (see its [runbook](/runbooks/previousrevisionunavailable/)), and a store problem surfaces as a `RollbackStoreFailed` event. --- # Secrets encryption (SOPS) Source: https://stageset.projects.metio.wtf/usage/encryption/ A stage's source can carry [SOPS](https://github.com/getsops/sops)-encrypted files — typically a `Secret` whose values are encrypted — and the controller decrypts them in memory, before building and applying the manifests. This mirrors Flux's `kustomize-controller` decryption contract, so an existing SOPS-encrypted repository works unchanged. Set `spec.decryption` and point it at a Secret holding the keys: ```yaml apiVersion: stages.metio.wtf/v1 kind: StageSet metadata: name: payments namespace: payments spec: serviceAccountName: payments-deployer decryption: provider: sops # the only provider secretRef: name: sops-age # a Secret in this namespace holding the age key stages: - name: app sourceRef: kind: GitRepository name: payments-config # contains an encrypted secret.yaml ``` ## Walkthrough — age [age](https://age-encryption.org/) is the simplest key type and needs no external service. Take a `Secret` from plaintext to a GitOps-safe rollout in four steps. **1. Generate an age key.** The file holds the private key; the printed `age1…` line is the public recipient to encrypt to. ```bash age-keygen -o age.agekey # public key: age1qz… ``` **2. Encrypt a Secret.** Encrypt only its values, so the file stays a valid Kubernetes object, then commit `secret.enc.yaml` (never the plaintext): ```yaml # secret.yaml apiVersion: v1 kind: Secret metadata: name: payments-db namespace: payments stringData: password: s3cr3t-do-not-commit-plaintext ``` ```bash sops --encrypt --age age1qz… \ --encrypted-regex '^(data|stringData)$' \ secret.yaml > secret.enc.yaml ``` **3. Put the private key in the cluster** under a `.agekey` data entry. Store `age.agekey` itself somewhere safe — it is the only thing that can decrypt the Secret. ```bash kubectl create secret generic sops-age \ --namespace payments \ --from-file=keys.agekey=age.agekey ``` **4. Decrypt on rollout.** Point a `StageSet` at the source holding `secret.enc.yaml` and set `spec.decryption` (as in the example above). On reconcile the controller fetches the source, decrypts every SOPS file in memory, builds, and applies — so the cluster holds the plaintext `payments-db` Secret while Git only ever held ciphertext. Grant the deployer ServiceAccount read access to the key Secret (see [tenancy](#how-keys-are-read--tenancy) below). ## Pairing with JaaS-rendered manifests A realistic app renders its config from Jsonnet with [JaaS](https://jaas.projects.metio.wtf/) and keeps only its Secret encrypted. The two compose cleanly because each owns one concern: - **JaaS renders the non-secret manifests.** It evaluates Jsonnet server-side and cannot hold secret values: SOPS ciphertext carries a MAC over the whole encrypted document, so it can't be authored in Jsonnet — and routing plaintext secrets through a render service is what you are avoiding. - **The Secret stays SOPS-encrypted in Git**, as in the walkthrough. - **The controller decrypts and orders both** under one `spec.decryption`: ```yaml spec: serviceAccountName: payments-deployer decryption: provider: sops secretRef: name: sops-age stages: - name: secrets # decrypt + apply the SOPS Secret first sourceRef: kind: GitRepository name: payments-secrets - name: app # then the JaaS-rendered app that mounts it sourceRef: apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet name: payments-app ``` The `secrets` stage runs first; only once the `Secret` is applied does the `app` stage roll out the rendered manifests that mount it. The encrypted Secret and the rendered config live in separate sources, so the Jsonnet author never touches secret material. ## The fields - **`provider`** — the decryption backend. Only `sops` is supported. - **`secretRef.name`** — a Secret in the `StageSet`'s namespace holding the keys, using the SOPS conventions: age private keys under data entries ending in `.agekey`, armored PGP private keys under `.asc`. Optional — omit it for a [cloud-KMS-only](#cloud-kms) setup. ## How keys are read — tenancy The key Secret is read in the `StageSet`'s namespace **under its `serviceAccountName`**, exactly like the manifests it applies. A tenant can only decrypt with key material its own ServiceAccount is allowed to read, so a key in one namespace is never reachable from another. Grant the deployer SA `get` on the key Secret: ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: payments-deployer-sops namespace: payments rules: - apiGroups: [""] resources: [secrets] resourceNames: [sops-age] verbs: [get] ``` In a [single-tenant cluster-admin](/usage/multi-cluster/#single-tenant-cluster-admin) install (no `serviceAccountName`), the controller reads the key Secret under its own identity instead. ## Decryption and the rollback store Decrypted bytes exist only in memory on the apply path. The one place rendered output is persisted is the optional [rollback store](/usage/rollback/), which is **encrypted at rest** (S3 SSE by default; an encrypted volume for the file store) — so a decrypted `Secret` never lands in plaintext on disk. See [encryption at rest](/usage/rollback/#encryption-at-rest). A rollback re-fetches the previous source and **runs decryption again** rather than restoring plaintext, so the key Secret must still exist when a rollback fires. If the key was rotated or deleted in the meantime, the rollback **fails closed** with `PreviousRevisionUnavailable` instead of applying a stale or unreadable Secret — an encrypted store cannot avoid this, and it is the safe failure direction. ## Cloud KMS SOPS files encrypted with a cloud KMS key (AWS KMS, GCP KMS, Azure Key Vault, or HashiCorp Vault) decrypt through the **controller's ambient credentials** — e.g. an IRSA role on EKS, wired via `serviceAccount.annotations`. No in-cluster key Secret is needed, so `secretRef` may be omitted for a KMS-only `StageSet`: ```yaml spec: decryption: provider: sops # secretRef omitted; KMS uses the controller's identity ``` One consequence to weigh in a multi-tenant cluster: unlike age (read under the tenant SA), **cloud KMS uses the controller's identity**, so any `StageSet` can decrypt a file encrypted with a KMS key the controller's role can access. This matches Flux's `kustomize-controller`. Scope the controller's KMS grant accordingly, or use age keys for hard per-tenant isolation. ## What's supported - **age** keys via `secretRef` — read under the tenant SA. The resource-level pattern (`--encrypted-regex '^(data|stringData)$'`) is the tested path. - **PGP** keys via `secretRef` (`.asc` entries) — read under the tenant SA, pure Go, no `gpg` binary or keyring needed. See [PGP keys](#pgp-keys). - **Cloud KMS** (AWS/GCP/Azure/Vault) via the controller's ambient credentials. - **Encrypted files feeding a `secretGenerator`** — an encrypted `.env` (or other file) referenced by a kustomize `secretGenerator` is decrypted before the build, so the generated `Secret` carries the plaintext. - A file with no SOPS metadata passes through untouched, so encrypted and plain manifests can sit side by side in one source. ## PGP keys PGP works **tenant-scoped**, like age: put one or more armored private keys in the `secretRef` Secret under data entries suffixed `.asc`. The data key is decrypted in pure Go (`ProtonMail/go-crypto`) directly from those keys — **no `gpg` binary, no GnuPG keyring, and no `GNUPGHOME`** — and the keys are read under the `StageSet`'s `serviceAccountName`, so a tenant can only use material its ServiceAccount can read. ```bash # export the armored private key and load it into the key Secret gpg --export-secret-keys --armor 0xYOURFINGERPRINT > key.asc kubectl create secret generic sops-keys \ --namespace payments \ --from-file=pgp.asc=key.asc ``` ```yaml spec: decryption: provider: sops secretRef: name: sops-keys # holds the *.asc private key(s) ``` One Secret can carry both age (`*.agekey`) and PGP (`*.asc`) keys; the right one is used per file. For a fresh setup, age is simpler and the recommended default, but an existing PGP-encrypted repository needs no migration. --- # Stages and sources Source: https://stageset.projects.metio.wtf/usage/stages-and-sources/ A `StageSet` is an ordered list of stages. Each stage resolves a [Flux](https://fluxcd.io/) source — a `GitRepository`, `OCIRepository`, `Bucket`, or an `ExternalArtifact` (the default) — applies its manifests, waits for them to become healthy, and only then lets the next stage start. ## One stage The minimum is one stage pointing at one artifact in the same namespace: ```yaml apiVersion: stages.metio.wtf/v1 kind: StageSet metadata: name: my-app namespace: default spec: stages: - name: app sourceRef: name: my-app # an ExternalArtifact ``` `sourceRef.kind` defaults to `ExternalArtifact`, so the common case is a single line. The controller fetches the artifact, applies every manifest in it, and marks the stage `Ready` once the applied objects report healthy. ## Source kinds A `sourceRef` resolves to a Flux artifact three ways. Point it at whichever you already have: ```yaml # 1. an ExternalArtifact (the default — kind omitted) sourceRef: name: my-app # 2. a classic Flux source, consumed directly sourceRef: kind: GitRepository # or OCIRepository, or Bucket name: my-app-manifests # 3. a producer that publishes an ExternalArtifact (resolved via its back-pointer) sourceRef: apiVersion: jaas.metio.wtf/v1 kind: JsonnetSnippet name: my-app ``` `GitRepository`, `OCIRepository`, and `Bucket` carry the same `status.artifact` contract as `ExternalArtifact`, so the controller reads them directly — no producer in between. A stage can apply manifests straight from a Git repo or an OCI artifact, like Flux's own `kustomize-controller`. For the producer case (for example rendering Jsonnet with [JaaS](https://jaas.projects.metio.wtf/)), see [producer-aware sources](/usage/producer-aware-sources/). ## Ordered stages Add more stages and they run top to bottom — each one waits for the previous to be `Ready`: ```yaml spec: stages: - name: crds # 1 ── install the CRDs first sourceRef: name: platform-crds - name: operator # 2 ── then the operator that needs them sourceRef: name: platform-operator - name: workloads # 3 ── then the workloads it manages sourceRef: name: team-workloads ``` This is the core of a `StageSet`: `operator` is never applied until `crds` is healthy, so the operator never crash-loops waiting for a CRD that isn't there yet. ## Shaping a stage's manifests A stage can build from a sub-path of the artifact, customize with patches, and substitute variables — the [kustomize](https://kubectl.docs.kubernetes.io/)-style surface: ```yaml spec: stages: - name: app sourceRef: name: my-app path: ./overlays/production # build a sub-path of the artifact prune: true # GC objects that leave this stage (default) patches: - patch: | - op: replace path: /spec/replicas value: 6 target: kind: Deployment name: web postBuild: substitute: cluster_name: prod-eu substituteFrom: - kind: ConfigMap name: cluster-vars - kind: Secret name: cluster-secrets optional: true ``` - **`path`** builds from a directory inside the artifact (default `./`). - **`prune`** (default `true`) garbage-collects objects that fall out of the stage between reconciles, tracked precisely via the stage's [`StageInventory`](/api/stageinventory/). - **`patches`** are strategic-merge or JSON6902 patches applied after the build. - **`postBuild`** substitutes `${var}` references from inline values, ConfigMaps, and Secrets at delivery time — see [parameterizing a rollout](/tutorials/parameters/) for the full render-time-vs-delivery-time treatment. From here, layer on [actions](/usage/actions/) to gate the stage, or [ready checks](/usage/ready-checks/) to define what "healthy" means. --- # Update windows Source: https://stageset.projects.metio.wtf/usage/update-windows/ Update windows gate *when* new artifact revisions roll out, without pausing reconciliation. Drift correction keeps running; only the rollout of a *new* revision is held until a window allows it. ## Deny a recurring window Freeze rollouts during business hours: ```yaml spec: stages: - name: app sourceRef: name: my-app updateWindows: - type: Deny schedule: "0 9 * * MON-FRI" # 5-field cron: start of the window duration: 8h timeZone: Europe/Berlin ``` A new revision that arrives inside the window is held; `status.pendingUpdate` records what is waiting and `nextWindowOpens` when it will ship. The controller emits an `UpdateDeferred` event and increments `stageset_update_deferred_total`. ## Allow-list windows If any `Allow` window exists, rollouts happen **only** inside an active Allow with no active Deny — `Deny` always wins. This expresses "only deploy on Tuesday and Thursday afternoons": ```yaml updateWindows: - type: Allow schedule: "0 14 * * TUE,THU" duration: 3h timeZone: America/New_York ``` ## A one-off freeze Absolute windows use `from`/`to` instead of a schedule — for a planned event freeze: ```yaml updateWindows: - type: Deny from: 2026-12-24T00:00:00Z to: 2026-12-27T00:00:00Z ``` ## What a closed window blocks `windowScope` controls what a closed window holds back: - **`Updates`** (default) — hold only the rollout of a *new* artifact revision. Drift correction keeps re-applying the pinned state, so the live cluster stays on its last-approved revision but doesn't fall out of sync. - **`All`** — a hard freeze: also pause drift correction, so the controller applies nothing at all while the window is closed. ```yaml windowScope: Updates # default: hold new revisions, keep correcting drift # windowScope: All # hard freeze: also pause drift correction ``` ## Shipping anyway To push a held rollout through immediately, override the window with [`stagesetctl`](/cli/): ```shell stagesetctl reconcile my-app --update-now ``` This stamps the `stages.metio.wtf/update-now` annotation; the honored value is recorded in `status.lastHandledUpdateOverride`. --- # Versioned migrations Source: https://stageset.projects.metio.wtf/usage/versioned-migrations/ Some changes only need to happen once, when you cross a release boundary — a one-time data backfill on the way to 2.0, a schema conversion between 1.x and 2.x. Versioned migrations run a ladder of [actions](/usage/actions/) exactly when the deployed version steps over the boundary, and never again. Versioning is off until you set `spec.version`. ## Declaring the version The controller needs to know *what version is currently being deployed*. There are three ways to declare it; pick by **where the version lives**. | Source | The version lives… | Best for | |---|---|---| | [`version.value`](#inline--versionvalue) | on the `StageSet` | environment-pinned versions, quick starts | | [`version.fromObject`](#from-a-rendered-object--versionfromobject) | inside the manifests | **any source, including JaaS** — the recommended default | | [`version.fromArtifact`](#from-a-file-in-the-artifact--versionfromartifact) | a file in the artifact | Git/OCI/Bucket sources that can ship a `VERSION` file | Whichever you choose, the resolved value is trimmed and parsed as semver (a leading `v` is accepted). A missing stage/object/file, an empty value, or an unparseable one fails terminally with the `InvalidVersion` reason (see its [runbook](/runbooks/invalidversion/)) — a half-versioned system is worse than an unversioned one. ### Inline — `version.value` The `StageSet` author pins the version directly. Use this when the version is a property of the environment rather than of the content, or to get started quickly: ```yaml spec: version: value: "2.1.0" # bump this when you cut a release ``` The trade-off: the version is declared here, not carried by the content, so you bump it by editing the `StageSet`. ### From a rendered object — `version.fromObject` The recommended way to let the version travel with the content. [Kubernetes](https://kubernetes.io/docs/) has a standard place for an application's version: the `app.kubernetes.io/version` label. Well-formed manifests set it, so the version is already inside the manifests — `fromObject` reads it back. This works for every source kind, including a single-document renderer like [JaaS](https://jaas.projects.metio.wtf/) that has no room for a separate file. ```yaml spec: version: fromObject: stage: app # which stage's rendered manifests carry it kind: Deployment # the object to read name: web # fieldPath omitted → reads metadata.labels['app.kubernetes.io/version'] stages: - name: app sourceRef: name: my-app ``` The controller builds the `app` stage's manifests (the same render it applies), finds the `Deployment/web` object, and reads its `app.kubernetes.io/version` label. Because the label is part of the manifests, the version changes in lockstep with the content — no second file to keep in sync. **Reading a different field.** Set `fieldPath` to a kubectl-style JSONPath that resolves to the bare version string. (It must be the version *only*; a JSONPath can't split an `image: web:2.1.0` value, so prefer the label.) `apiVersion` is optional and narrows the match when a `Kind`+`Name` pair would be ambiguous: ```yaml spec: version: fromObject: stage: app apiVersion: v1 kind: ConfigMap name: app-meta fieldPath: "{.data.version}" # must resolve to a bare semver, e.g. 2.1.0 ``` This is the path the [Jsonnet-to-rollout tutorial](/tutorials/jsonnet-to-rollout/) uses: the snippet renders the version into the manifest's version label, and the StageSet reads it straight back. ### From a file in the artifact — `version.fromArtifact` The version travels with the content as a **dedicated file** containing a single semver. This fits **Git/OCI/Bucket** sources, where you can ship an extra file beside the manifests. (It does *not* fit JaaS `rendered` output, which is a single `rendered.json`; use `fromObject` there.) **Who writes it, and where:** the artifact's producer. For a Git source, commit a `VERSION` file in the repo; for an OCI/Bucket artifact, include it in the pushed tree. The file lives at `path` inside the named stage's artifact, relative to the artifact root: ```text # VERSION — committed alongside the manifests it versions 2.1.0 ``` ```yaml spec: version: fromArtifact: stage: app # which stage's artifact carries the file path: VERSION # the file's path inside that artifact (cleaned; no leading ./) stages: - name: app sourceRef: kind: GitRepository name: my-app ``` The controller fetches the `app` stage's artifact and reads the file at `path`. ## Declaring migrations Each migration names the boundary it crosses (`to`, optionally `from`), the stage it anchors before, and the actions to run: ```yaml spec: version: fromArtifact: stage: app path: VERSION migrations: - name: backfill-ledger-2-0 from: "1.*" # optional: only when coming from a 1.x to: "2.0.0" # the boundary this migration crosses stage: app # runs before this stage's pre-actions actions: - name: backfill job: sourceRef: name: ledger-backfill-job stages: - name: app sourceRef: name: my-app ``` When the deployed version crosses from a `1.x` into `2.0.0`, the `backfill` job runs once, anchored before the `app` stage. The controller tracks progress so a retry doesn't re-run a completed migration: - `status.version` — the deployed version, written only after a fully successful run. - `status.pendingMigrations` — migrations the next run will execute. - `status.executedMigrations` — the in-flight ledger for the current transition. Migrations emit `MigrationStarted` / `MigrationCompleted` events (and `MigrationFailed` on error). A downgrade that would skip a required migration is refused with the `DowngradeRequiresMigration` reason — see its [runbook](/runbooks/downgraderequiresmigration/). --- # CLI Source: https://stageset.projects.metio.wtf/cli/ `stagesetctl` previews, renders, and drives StageSets without waiting for the next reconcile. It speaks to the cluster with your own kubeconfig — nothing about it runs in-cluster. Installed on your `PATH` as `kubectl-stageset`, it also works as a kubectl plugin: `kubectl stageset ` is equivalent to `stagesetctl `. | Command | Purpose | |---|---| | [`get`](/cli/get/) | Print a StageSet's status, or list StageSets. | | [`build`](/cli/build/) | Render a StageSet's manifests to stdout. | | [`diff`](/cli/diff/) | Preview what a reconcile would change; usable as a CI gate. | | [`reconcile`](/cli/reconcile/) | Force an out-of-band reconcile. | ## Global flags Every command accepts the standard kubectl connection flags (`genericclioptions.ConfigFlags`): `--kubeconfig`, `--context`, `-n/--namespace`, `--as`, `--as-group`, `--server`, `--token`, `--request-timeout`, and the rest. `--version` prints the binary version and commit (` (commit )`). With no `-n/--namespace`, the command uses the namespace from your current kubeconfig context, falling back to `default`. ## Exit codes Every command shares the same baseline: | Code | Meaning | |---|---| | `0` | Success. | | `2` | Usage or flag error. | | `3` | Runtime error. | [`diff`](/cli/diff/) adds one more: it exits `1` when it finds changes (the `diff(1)` convention), so it can gate a CI pipeline. --- # stagesetctl build Source: https://stageset.projects.metio.wtf/cli/build/ Runs the same resolve → fetch → build pipeline the controller uses and writes the result — a multi-document YAML stream — to stdout. This is what would be applied, before it is applied. To preview the change against live cluster state instead, use [`diff`](/cli/diff/). ```text stagesetctl build NAME [flags] ``` | Flag | Default | Description | |---|---|---| | `--stage` | _(all)_ | Render only the named stage(s); repeatable. | | `--source-dir` | _(none)_ | Use a local artifact tree as `[STAGE=]PATH` instead of fetching from the cluster; repeatable. | | `--show-secrets` | `false` | Reveal Secret values instead of masking them. | | `--as-tenant` | `false` | Render impersonating the StageSet's `spec.serviceAccountName` (see [multi-cluster and tenancy](/usage/multi-cluster/)). | Secret values are masked by default, so the output is safe to paste into a review. `build` writes YAML unconditionally — there is no output-format flag. ## Example ```shell stagesetctl build payments --stage application ``` ```yaml --- apiVersion: apps/v1 kind: Deployment metadata: name: web namespace: payments spec: replicas: 6 selector: matchLabels: {app: web} template: metadata: labels: {app: web} spec: containers: - name: web image: registry.internal/web:2.1.0 --- apiVersion: v1 kind: Secret metadata: name: web-config namespace: payments type: Opaque data: token: '***' # masked; pass --show-secrets to reveal ``` `--source-dir` makes `build` work offline — point it at the directory an artifact would have unpacked to and it skips the cluster fetch, for authoring and CI. The value is `[STAGE=]PATH`: prefix a stage name to target one stage, or give a bare path to feed every stage that has no entry of its own. Repeat the flag to map each stage to its own tree: ```shell # one stage from a local tree stagesetctl build payments --stage application --source-dir application=./out # every stage from one tree (bare path), overriding just infrastructure stagesetctl build payments \ --source-dir ./checkout \ --source-dir infrastructure=./infra-checkout ``` --- # stagesetctl diff Source: https://stageset.projects.metio.wtf/cli/diff/ By default `diff` performs a [server-side](https://kubernetes.io/docs/reference/using-api/server-side-apply/) dry-run apply and exits `1` when there are changes, so it works as a CI gate. It shows, per object, what a reconcile would create, configure, or delete, plus the [actions](/usage/actions/) a rollout would run. To see the full rendered manifests without comparing against the cluster, use [`build`](/cli/build/). ```text stagesetctl diff NAME [flags] ``` | Flag | Default | Description | |---|---|---| | `--stage` | _(all)_ | Diff only the named stage(s); repeatable. | | `--source-dir` | _(none)_ | Use a local artifact tree as `[STAGE=]PATH`; repeatable. Skips the cluster fetch. | | `--server-side` | `true` | Server-side dry-run apply diff (needs update/patch RBAC). `false` renders client-side against live objects. | | `--as-tenant` | `false` | Render and dry-run impersonating `spec.serviceAccountName` (see [multi-cluster and tenancy](/usage/multi-cluster/)). | | `--show-secrets` | `false` | Reveal Secret values instead of masking. | | `--show-unchanged` | `false` | Include objects with no change. | | `--prune` | `true` | Show resources that would be deleted (fell out of inventory). | | `--color` | `auto` | Colorize output: `auto`, `always`, or `never`. | | `--exit-code` | `true` | Exit `1` when changes are found. `false` always exits `0` on a clean run. | ## Example ```shell stagesetctl diff payments ``` ```text --- live +++ merged @@ Deployment payments/web @@ spec: - replicas: 3 + replicas: 6 - ConfigMap payments/old-feature-flags (pruned: fell out of inventory) Actions to run: application: pre db-migrate job ledger-migrations post smoke-test http https://payments.internal/healthz ``` Objects that left the stage's [inventory](/api/stageinventory/) show as deletions (`pruned: …`); pass `--prune=false` to hide them. The trailing `Actions to run` block lists the [pre/post/onFailure actions](/usage/actions/) a real reconcile would execute — `diff` never runs them, it only reports them. A clean run prints nothing and exits `0`; pending changes exit `1`. To inspect without failing the shell: ```shell stagesetctl diff payments --color=never --exit-code=false ``` Use `--server-side=false` when you lack apply RBAC and only need a textual render-versus-live comparison. --- # stagesetctl get Source: https://stageset.projects.metio.wtf/cli/get/ With no `NAME`, lists StageSets in the current namespace. With a `NAME`, prints that StageSet's detail (Ready reason, per-stage phase, revisions, version) — a readable view of [`StageSet.status`](/api/stageset/#status). ```text stagesetctl get [NAME] [flags] ``` | Flag | Default | Description | |---|---|---| | `-A`, `--all-namespaces` | `false` | List StageSets across all namespaces. | | `-o`, `--output` | _(table)_ | Output format: empty for the human table, or `yaml` / `json`. | ## Listing ```shell stagesetctl get -A ``` ```text NAMESPACE NAME READY REASON STAGES VERSION PENDING payments payments True Succeeded 2/2 2.1.0 - platform platform True Succeeded 3/3 - - staging web False StageFailed 1/2 - - ``` `STAGES` is `ready/total`; `PENDING` shows `held until