# StageSet Controller — full documentation

> The complete StageSet Controller documentation (https://stageset.projects.metio.wtf/) concatenated for
> LLMs. For a concise link index see https://stageset.projects.metio.wtf/llms.txt.


# StageSet Controller

`stageset-controller` is a [Flux](https://fluxcd.io/) controller for ordered, gated, multi-stage delivery.

Flux's `kustomize-controller` and `helm-controller` apply an artifact in one
shot. That fits most releases, but not one that has to happen in sequence:
install the CRDs before the operator that needs them, run a database migration
before the app that reads the new schema, hold a production rollout until the
canary is healthy, freeze changes during business hours.

A `StageSet` describes a release as an ordered list of stages. Each stage applies
a Flux source — a `GitRepository`, `OCIRepository`, `Bucket`, or an
[`ExternalArtifact`](https://fluxcd.io/flux/components/source/externalartifacts/)
(including one rendered on the fly by a producer like [JaaS](https://jaas.projects.metio.wtf/)) —
waits for it to become healthy, and only then lets the next stage begin. Between
stages, run typed actions (a migration `Job`, an HTTP gate, a wait-for-condition),
gate rollouts behind [update windows](/usage/update-windows/), and run
version-aware [migrations](/usage/versioned-migrations/) when you cross a release boundary.
Everything is reconciled continuously, drift-corrected, and pruned with ApplySet
semantics.

## What a StageSet looks like

The smallest useful StageSet is one stage pointing at one artifact:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: my-app
  namespace: default
spec:
  stages:
    - name: app
      sourceRef:
        name: my-app          # an ExternalArtifact in this namespace
```

The same shape scales up to a gated rollout:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: payments
  namespace: payments
spec:
  serviceAccountName: payments-deployer     # every apply is impersonated as this SA

  stages:
    # 1 ── shared infrastructure: CRDs, namespaces, RBAC
    - name: infrastructure
      sourceRef:
        name: payments-infra                # an ExternalArtifact
      readyChecks:
        checks:
          - apiVersion: apiextensions.k8s.io/v1
            kind: CustomResourceDefinition
            name: ledgers.payments.example

    # 2 ── the application, started only once infrastructure is Ready
    - name: application
      sourceRef:
        name: payments-app
      actions:
        pre:
          - name: db-migrate                # runs before the manifests are applied
            job:
              sourceRef:
                name: payments-migrations
        post:
          - name: smoke-test                # stage is Ready only if this passes
            http:
              url: https://payments.internal/healthz
              expectedStatus: [200]

  # new revisions roll out only outside the Friday-evening change freeze
  updateWindows:
    - type: Deny
      schedule: "0 17 * * FRI"
      duration: 60h
      timeZone: Europe/Berlin
```

Stages run top to bottom. `infrastructure` must report Ready (its CRD established)
before `application` is touched; the migration Job runs before the app is applied;
the rollout is held when the change-freeze window is open. Everything is
continuously reconciled — drift is corrected, removed objects are pruned.

## Where to go next

- **[Installation](/installation/)** — install on Kubernetes, then harden for
  production and wire up observability.
- **[Usage](/usage/)** — worked examples for every feature, from a single stage
  to versioned migrations.
- **[CLI](/cli/)** — `stagesetctl` for previewing (`diff`), rendering (`build`),
  and driving (`reconcile`) StageSets.
- **[API reference](/api/)** — every field of every custom resource, explained.
- **[Comparisons](/comparisons/)** — how StageSet relates to Helm, Kustomize,
  Tanka, kubecfg, and plain Flux.
- **[Runbooks](/runbooks/)** — symptom → cause → remediation for every status
  reason.

## Related projects

`stageset-controller` handles the delivery end and composes with two adjacent
projects, each useful on its own:

- **[JOI](https://github.com/metio/jsonnet-oci-images)** publishes Jsonnet
  libraries as single-layer OCI images.
- **[JaaS](https://jaas.projects.metio.wtf/)** evaluates Jsonnet on demand and
  publishes the result as a Flux `ExternalArtifact`.
- `stageset-controller` takes those artifacts and rolls them out, in order, with
  gates.

JOI and JaaS are not required — a stage reads straight from a `GitRepository`,
`OCIRepository`, or `Bucket`, or from any `ExternalArtifact`, whatever produced
it.


---

# Installation

Source: https://stageset.projects.metio.wtf/installation/


Get `stageset-controller` running on a [Kubernetes](https://kubernetes.io/docs/)
cluster, then keep it healthy in [production](/installation/production/).


---

# Configuration reference

Source: https://stageset.projects.metio.wtf/installation/configuration/


The controller is configured entirely through command-line flags, grouped below
by subsystem. When deployed via the Helm chart you never pass these directly — the
chart sets them from your values and its own defaults; each section notes the Helm
value that drives a flag, and the
[metio/helm-charts](https://github.com/metio/helm-charts/tree/main/charts/stageset-controller)
repo carries the full values reference. For the Helm values worth tuning and the
reasoning behind each, see [Production](/installation/production/#settings-you-can-tune);
for metrics and runbooks, [Operations](/installation/operations/).

## Manager and leader election

| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--health-probe-bind-address` | `:8081` | Address the liveness and readiness probe endpoints bind to. | _chart-managed_ |
| `--leader-elect` | `false` | Enable controller-runtime leader election so only one replica reconciles at a time. Recommended for HA deployments. | `controller.leaderElect` |

The leader-election lease name is fixed at `stageset-controller.stages.metio.wtf`
and is created in the namespace the controller pod runs in.

## Watch scope

| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--watch-namespaces` | _(empty)_ | Comma-separated list of namespaces the controller watches. Empty (the default) means cluster-wide. When set, the manager's cache only observes StageSets and sources in these namespaces — the multi-tenant controller-instances pattern. Falls back to the `STAGESET_WATCH_NAMESPACES` environment variable when the flag is empty. | `controller.watchNamespaces` |

**Environment variable:** `STAGESET_WATCH_NAMESPACES` — comma-separated
namespace list. When `--watch-namespaces` is non-empty the flag takes
precedence. When restricted, the chart pivots RBAC to per-namespace
RoleBindings instead of a cluster-wide ClusterRoleBinding.

## Reconciliation defaults

| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--default-interval` | `10m` | Reconcile cadence for StageSets that omit `spec.interval`. | `controller.defaultInterval` |
| `--inventory-mode` | `hybrid` | Inventory strategy for tracking applied resources: `entries`, `hybrid`, or `applyset`. | `controller.inventoryMode` |
| `--inventory-shard-cap` | `5000` | Maximum number of resource entries per `StageInventory` shard. | `controller.inventoryShardCap` |
| `--no-cross-namespace-refs` | `false` | Deny `sourceRef` and `dependsOn` references that target a different namespace. | `controller.noCrossNamespaceRefs` |
| `--allowed-action-hosts` | _(empty)_ | Host glob allowed for `http` actions; repeatable. Loopback and link-local ranges are always denied unless explicitly listed. | `controller.allowedActionHosts` |
| `--runbook-base-url` | _(empty)_ | URL prefix appended to actionable Ready condition messages as `(runbook: <base>/<reason>/)`. Empty disables. | `controller.runbookBaseURL` |

## Rollback store — filesystem

The rollback store preserves a copy of each stage's last-applied artifact so
that a rollback can re-apply the previous revision without re-fetching from the
producer. The filesystem backend is appropriate for single-replica deployments or
multi-replica deployments backed by an `RWX` volume.

`--rollback-store-path` and `--rollback-store-s3-endpoint` are mutually
exclusive. Both empty disables the store; rollback falls back to re-fetching the
producer artifact.

| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--rollback-store-path` | _(empty)_ | Filesystem directory (e.g. an RWX PVC mount) for the rollback store. Empty disables the filesystem backend. | `rollbackStore.backend: pvc` |

The file store writes rendered output — including Secret data — in the clear.
The volume must provide encryption at rest (encrypted StorageClass, LUKS, or
cloud-disk encryption).

## Rollback store — S3

Active when `--rollback-store-s3-endpoint` and `--rollback-store-s3-bucket` are
both non-empty.

| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--rollback-store-s3-endpoint` | _(empty)_ | S3-compatible endpoint (`host:port`, e.g. `s3.amazonaws.com` or `minio.minio.svc:9000`). Empty disables the S3 backend. | `rollbackStore.s3.endpoint` |
| `--rollback-store-s3-bucket` | _(empty)_ | S3 bucket for the rollback store. Must already exist. | `rollbackStore.s3.bucket` |
| `--rollback-store-s3-prefix` | _(empty)_ | Optional object-key prefix so the rollback store can coexist with other tenants in one bucket. | `rollbackStore.s3.prefix` |
| `--rollback-store-s3-region` | _(empty)_ | S3 region. Required for AWS multi-region buckets; ignored by most S3-compatible servers. | `rollbackStore.s3.region` |
| `--rollback-store-s3-use-ssl` | `true` | Use HTTPS to talk to the S3 endpoint. Set to `false` only for local MinIO over plain HTTP. | `rollbackStore.s3.useSSL` |
| `--rollback-store-s3-access-key` | _(empty)_ | Static access key. Empty engages minio-go's IAM/IRSA credential discovery chain (env → web-identity → EC2/EKS metadata). | `rollbackStore.s3.existingSecret` |
| `--rollback-store-s3-secret-key` | _(empty)_ | Secret key, paired with `--rollback-store-s3-access-key`. | `rollbackStore.s3.existingSecret` |
| `--rollback-store-s3-session-token` | _(empty)_ | Optional session token for temporary credentials (e.g. IRSA). | `rollbackStore.s3.existingSecret` |
| `--rollback-store-s3-anonymous` | `false` | Skip request signing. For public buckets only. | `rollbackStore.s3.anonymous` |
| `--rollback-store-s3-sse` | `s3` | Server-side encryption for stored objects: `none`, `s3` (SSE-S3), or `kms` (SSE-KMS). The store holds rendered Secret data, so encryption is on by default. Set `none` only for a bucket whose backend cannot honor an SSE header. | `rollbackStore.s3.sse` |
| `--rollback-store-s3-sse-kms-key` | _(empty)_ | KMS key ARN or ID for `--rollback-store-s3-sse=kms`. Empty uses the bucket's default KMS key. | `rollbackStore.s3.sseKmsKeyId` |

## Metrics and health

| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--metrics-bind-address` | `:8080` | Address the controller-runtime Prometheus metrics endpoint binds to. `"0"` disables. | _chart-managed_ |

The metrics endpoint exposes standard `controller_runtime_*` and `workqueue_*`
series alongside the custom `stageset_*` metrics documented in
[Operations](/installation/operations/).

## Webhook and TLS provisioning

The validating admission webhook for `StageSet` is enabled by default. Two TLS
provisioning modes are supported.

| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--enable-webhook` | `true` | Enable the validating admission webhook for `StageSet`. | _chart-managed_ |
| `--webhook-cert-mode` | `cert-manager` | TLS provisioning mode: `cert-manager` (chart renders a `Certificate` CR; cert is mounted from a Secret) or `self-signed` (the controller generates a CA and serving cert in-pod and patches the `ValidatingWebhookConfiguration` `caBundle`). | `webhook.certMode` |
| `--webhook-cert-dir` | `/tmp/k8s-webhook-server/serving-certs` | Directory holding `tls.crt` and `tls.key` for the webhook server. | _chart-managed_ |
| `--webhook-port` | `9443` | Port the validating webhook server binds to. | _chart-managed_ |
| `--webhook-cert-validity` | `8760h` (1 year) | Validity of the self-signed serving cert. The controller rotates it every `validity/3`. | `webhook.*` |
| `--webhook-service-name` | `stageset-controller-webhook` | Kubernetes Service the webhook is reachable through. Used to build cert SANs in `self-signed` mode. | _chart-managed_ |
| `--webhook-service-namespace` | _(empty)_ | Namespace of the webhook Service. Empty falls back to the in-cluster ServiceAccount namespace. | _chart-managed_ |
| `--webhook-validating-config-name` | _(empty)_ | Name of the `ValidatingWebhookConfiguration` whose `caBundle` the controller patches. Required when `--webhook-cert-mode=self-signed`. | _chart-managed_ |

## Gate endpoint

The gate endpoint exposes a read-only HTTP API for Flagger canary stage-gates.
`GET /gate/{namespace}/{stageset}/{stage}` returns `200` when the named stage is
ready to advance and `503` otherwise.

| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--gate-bind-address` | `:8082` | Address for the Flagger stage-gate endpoint. Empty disables the endpoint. | `gate.enabled` |

## Logging

Logging is powered by the controller-runtime `zap` logger. The standard zap
flags (`--zap-log-level`, `--zap-encoder`, `--zap-stacktrace-level`,
`--zap-time-encoding`, and `--zap-devel`) are available and bound to
`flag.CommandLine`; run `stageset-controller --help` to see their current
defaults.


---

# Install on Kubernetes

Source: https://stageset.projects.metio.wtf/installation/kubernetes/


## Prerequisites

- A [Kubernetes](https://kubernetes.io/docs/) cluster with `kubectl` and
  [`helm`](https://helm.sh/) configured against it.
- [Flux](https://fluxcd.io/) `source-controller`, specifically the
  `ExternalArtifact` API (`source.toolkit.fluxcd.io`). A `StageSet` stage always
  resolves to an `ExternalArtifact`, so the CRD must exist. `ExternalArtifact`
  lands in Flux **v2.7.0**; install at least that version. The controller also
  watches `GitRepository`, `OCIRepository`, and `Bucket` sources for
  producer-aware resolution.
- [cert-manager](https://cert-manager.io/), only if you choose the
  `cert-manager` webhook certificate mode. The chart defaults to `self-signed`,
  which provisions and rotates the admission webhook's TLS in-process and needs
  no cert-manager. See [production](/installation/production/#admission-webhook-tls)
  for the trade-off.

[JaaS](https://jaas.projects.metio.wtf/), JOI, or any particular artifact
producer are not required to install the controller — those are sources of
`ExternalArtifact`s, wired up per `StageSet`.

## Install with Helm

The controller is distributed as an OCI [Helm](https://helm.sh/) chart. The
deployment manifests live in the chart, not in the controller repository.

```shell
helm upgrade --install stageset-controller \
  oci://ghcr.io/metio/helm-charts/stageset-controller \
  --namespace stageset-system --create-namespace
```

The container image is `ghcr.io/metio/stageset-controller`; the chart pins the
tag to its own `appVersion` by default.

Every setting referenced across these docs — HA replicas, the rollback store,
webhook mode, NetworkPolicy, the ServiceMonitor, and the rest — is a Helm value.
The [chart's README and `values.yaml`](https://github.com/metio/helm-charts/tree/main/charts/stageset-controller)
document the full, current list.

### What the chart installs

- The **controller `Deployment`**, its `ServiceAccount`, and the cluster RBAC it
  needs (a `ClusterRole` + `ClusterRoleBinding`, plus a namespaced leader-election
  `Role`/`RoleBinding`).
- The **CRDs** — `StageSet` and `StageInventory`.
- The **validating admission webhook** (`ValidatingWebhookConfiguration` + a
  webhook `Service`).
- A **metrics `Service`** (and an opt-in `ServiceMonitor`).
- The **Flagger gate `Service`** for the read-only stage-gate endpoint.
- Opt-in extras: `NetworkPolicy`, `PodDisruptionBudget`,
  `HorizontalPodAutoscaler`, a rollback-store `PersistentVolumeClaim`, and a
  managed `Namespace`.

### About the CRDs

The CRDs ship inside the chart's regular templates (not Helm's special `crds/`
directory), so a `helm upgrade` applies schema changes like any other resource.
This is governed by `crds.create` (default `true`). The CRDs carry
`helm.sh/resource-policy: keep`, so a `helm uninstall` leaves them — and your
StageSets — in place; remove them by hand if you really mean to.

If you manage CRDs out of band, the raw definitions are also published in the
controller repository under `config/crd/` and can be applied with
`kubectl apply --server-side -f`.

## Verify

```shell
kubectl -n stageset-system get deploy stageset-controller
kubectl get crd stagesets.stages.metio.wtf stageinventories.stages.metio.wtf
```

Once the controller is `Available`, create your first
[StageSet](/usage/stages-and-sources/).


---

# Operations

Source: https://stageset.projects.metio.wtf/installation/operations/


## Metrics

The controller registers custom metrics on the controller-runtime registry, served
on `--metrics-bind-address` (`:8080`) alongside the standard
`controller_runtime_*` and `workqueue_*` series. Enable scraping with the chart's
opt-in `ServiceMonitor` (`metrics.serviceMonitor.enabled`):

```yaml
# values.yaml
metrics:
  serviceMonitor:
    enabled: true        # needs the Prometheus operator CRDs
```

| Metric | Type | Labels | Meaning |
|---|---|---|---|
| `stageset_reconcile_total` | counter | `namespace`, `name`, `reason` | Reconciles, by terminal Ready reason. |
| `stageset_stage_applied_total` | counter | `namespace`, `name`, `stage` | Stages applied and verified. |
| `stageset_drift_corrected_total` | counter | `namespace`, `name`, `stage` | Out-of-band drift re-asserted on a steady-state reconcile. |
| `stageset_update_deferred_total` | counter | `namespace`, `name` | Rollouts held by a closed update window. |
| `stageset_webhook_cert_renewal_failures_total` | counter | _(none)_ | Failed self-signed webhook cert renewals. |
| `stageset_stage_ready` | gauge | `namespace`, `stageset`, `stage` | `1` when a stage is Ready, else `0` — for metric-based [progressive delivery](/tutorials/progressive-delivery/#argo-rollouts). |

## Alerts

The chart ships an opt-in `PrometheusRule` with a starter alert set, gated on
`metrics.prometheusRule.enabled` (requires the
[Prometheus operator](https://prometheus-operator.dev/) CRDs). It covers the
custom `stageset_*` metrics plus controller-runtime signals:

| Alert | Fires on | Severity |
|---|---|---|
| `StageSetReconcileErrorsHigh` | per-StageSet Ready=False rate (excludes the healthy `Succeeded`/`Suspended` reasons) | warning |
| `StageSetControllerWorkqueueDepthHigh` | the reconcile queue not draining | warning |
| `StageSetReconcileLatencyHigh` | reconcile p99 latency over threshold | warning |
| `StageSetControllerPodDown` | a controller pod NotReady | critical |
| `StageSetWebhookCertRenewalFailing` | self-signed cert rotation failing | critical |

Every threshold is a knob under `metrics.prometheusRule.thresholds`, and
`extraAlertLabels` is merged onto every rendered alert so all stageset alerts can
route through one Alertmanager receiver. Each alert carries a `runbook_url`
annotation pointing at the matching [runbook](/runbooks/) page on this site
(`metrics.prometheusRule.runbookBaseURL`); the reconcile-errors alert templates the
URL on `$labels.reason`. Append your own rules under
`metrics.prometheusRule.extraRules`, and silence a built-in alert by raising its
threshold rather than forking the chart.

## Events

The controller emits Kubernetes Events on every Ready-condition transition, so
`kubectl describe stageset <name>` and [Flux](https://fluxcd.io/)'s
`notification-controller` (via an `Alert` targeting `kind: StageSet`) both
surface what happened. Normal events
include `Succeeded`, `UpdateDeferred`, `MigrationStarted`, and
`MigrationCompleted`; warnings include `StageFailed`, `DriftCorrected`,
`RolledBack`, `MigrationFailed`, `OnFailureAction`, and `RollbackStoreFailed`.

## Runbooks

Every actionable Ready-condition reason has a [runbook](/runbooks/) covering the
symptom, cause, diagnosis, and remediation. Set `--runbook-base-url` (the chart's
`controller.runbookBaseURL`, which defaults to this docs site) to a published copy
of those pages and the controller appends `(runbook: <base>/<reason>/)` to the
Ready message (the reason lower-cased into a path segment), so a `kubectl describe`
links straight to the fix. Healthy reasons (`Succeeded`, `Suspended`) get no link.

```yaml
# values.yaml — point at your own mirror, or set "" to drop the links
controller:
  runbookBaseURL: https://runbooks.internal/stageset
```

For example, a `StageFailed` StageSet then shows:

```text
Message:  stage "application" failed: … (runbook: https://runbooks.internal/stageset/stagefailed/)
```

## Forcing a reconcile

The controller reconciles on its `spec.interval`, on source changes, and on
demand. To trigger an out-of-band run, stamp the standard annotation — which is
what `flux reconcile` and [`stagesetctl reconcile`](/cli/reconcile/) do for you:

```shell
kubectl annotate stageset my-app \
  reconcile.fluxcd.io/requestedAt="$(date -u +%FT%TZ)" --overwrite
```

The handled token is recorded in `status.lastHandledReconcileAt`.

## Drift correction

On a steady-state reconcile the controller re-asserts the desired state, healing
out-of-band changes to managed objects. Each correction emits a `DriftCorrected`
event and increments `stageset_drift_corrected_total`. Tighten the cadence with
`spec.driftDetectionInterval` when you need faster healing than `spec.interval`.


---

# Production

Source: https://stageset.projects.metio.wtf/installation/production/


## High availability

The controller supports leader-elected HA. Enable leader election and run more
than one replica; only the lease holder reconciles, while every replica answers
admission webhook calls (admission must stay available even on non-leaders).

- Leader election is toggled with `--leader-elect`. The binary defaults it to
  `false`, but the **Helm chart enables it by default** (`controller.leaderElect:
  true`), so a default install is already lease-guarded even at one replica.
- The lease is named `stageset-controller.stages.metio.wtf` and lives in the
  controller's namespace. It uses controller-runtime's default timing (~15 s
  lease duration). The lease is **not** released eagerly on shutdown, so after a
  rolling update the new leader takes over when the old lease expires — budget a
  few seconds of reconcile pause on restart (admission and the gate endpoint are
  unaffected).
- Scaling: when the chart's `replicas.max` exceeds `replicas.min` it renders a
  `HorizontalPodAutoscaler` (CPU target 80%) and a `PodDisruptionBudget`
  (`minAvailable: 1`). At the default 1/1 it sets neither and leaves
  `spec.replicas` unmanaged.

The controller watches every namespace by default. Multi-tenancy is enforced per
`StageSet` through impersonation (see below). You can additionally scope the
controller to a namespace set with `controller.watchNamespaces` — one controller
instance per tenant-group — and run it under `cluster-admin` for single-tenant
clusters; both are covered in
[multi-cluster and tenancy](/usage/multi-cluster/).

## Hardening

Each option below is shown as the Helm values that configure it. Several are
already the chart's defaults, shown so you can see what is applied and override
it for a stricter policy.

### Tenant impersonation

The controller never applies your manifests with its own identity. Every cluster
operation for a `StageSet` — building, applying, pruning, running actions — is
performed impersonating the `StageSet`'s `spec.serviceAccountName` (the chart
grants the controller `impersonate`, not write access). A `StageSet` can only do
what its tenant SA permits; an over-broad or missing SA fails closed.

This one lives on the `StageSet`, not in the chart — give every production
`StageSet` a scoped `ServiceAccount`:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata: { name: payments, namespace: payments }
spec:
  serviceAccountName: payments-deployer   # scoped to exactly this release's needs
  # …
```

### Pod security context

The chart runs a non-root, read-only-root-filesystem pod with all capabilities
dropped, on a `gcr.io/distroless/static:nonroot` image (no shell or package
manager). These are the rendered defaults:

```yaml
podSecurityContext:
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault
securityContext:
  runAsNonRoot: true
  runAsUser: 65532
  runAsGroup: 65532
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: [ALL]
  seccompProfile:
    type: RuntimeDefault
```

### Resource limits

Requests equal limits, so the pod is fully constrained:

```yaml
resources:
  cpu: 50m
  memory: 256Mi
  ephemeralStorage: 32Mi   # /tmp and the self-signed cert dir are emptyDirs
```

### Pod-Security Standards namespace

Have the chart create the install namespace with restricted PSS labels:

```yaml
namespace:
  create: true
  pssLevel: restricted     # or: baseline / privileged
```

### Network policy

The gate endpoint is **unauthenticated** (read-only
`GET /gate/{namespace}/{stageset}/{stage}`). Turn on the ingress-only NetworkPolicy
to fence it — and the webhook/metrics ports — to only the peers that need them:

```yaml
networkPolicy:
  enabled: true            # admits the webhook (9443), metrics (8080), gate (8082)
```

The policy is **ingress-only**, so it does not restrict egress — the controller can
still fetch stage artifacts over HTTP from source-controller (an `ExternalArtifact`
or a `GitRepository`/`OCIRepository`/`Bucket` is served from the same artifact
endpoint). If your cluster default-denies egress, add an egress allowance to
source-controller (and DNS) so those fetches succeed.

### Admission webhook TLS

`webhook.certMode` chooses how the webhook serving certificate is obtained:

```yaml
webhook:
  certMode: cert-manager   # cert-manager issues + rotates the cert (requires cert-manager)
  # certMode: self-signed  # chart default: in-pod CA + serving cert, rotated at
  #                          validity/3, with no cert-manager dependency
```

## Reference setups

Two HA shapes — on-prem with shared RWX storage, and AWS/EKS with S3 — over the
same backbone: a leader-elected pair (or trio), a rollback store reachable from
whichever pod holds the lease, cert-manager for the webhook, a `NetworkPolicy`
fencing the unauthenticated gate, and a `ServiceMonitor` if you run Prometheus.

Both run two replicas for [HA](#high-availability) (`replicas.max` above
`replicas.min` also renders a PDB and an HPA) and set
`webhook.certMode: cert-manager`, so [cert-manager](https://cert-manager.io/) must
be installed in the cluster.

### On-prem (RWX storage)

The rollback store gives bit-exact rollbacks that outlive producer GC. With HA
replicas it must be reachable from whichever pod holds the lease, so use a
`ReadWriteMany` PVC on your on-prem storage class — every replica mounts the same
volume.

```yaml
# values-onprem.yaml
replicas:
  min: 2                 # leader-elected HA; the non-leader still serves admission
  max: 3                 # > min renders an HPA (CPU 80%) and a PodDisruptionBudget

controller:
  leaderElect: true

rollbackStore:
  backend: pvc
  pvc:
    accessModes: [ReadWriteMany]
    storageClass: nfs-client     # your RWX class (NFS, CephFS, …)
    size: 10Gi

webhook:
  certMode: cert-manager         # requires cert-manager in the cluster

networkPolicy:
  enabled: true                  # fences the unauthenticated gate endpoint

metrics:
  serviceMonitor:
    enabled: true
```

```shell
helm upgrade --install stageset-controller \
  oci://ghcr.io/metio/helm-charts/stageset-controller \
  --namespace stageset-system --create-namespace \
  -f values-onprem.yaml
```

### AWS / EKS (S3)

On EKS, back the rollback store with S3 and let the controller assume an IAM role
through [IRSA](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)
— no static keys. Annotate the controller's ServiceAccount with the role ARN and
leave the S3 credentials empty; the store's minio-go client picks the role up from
the pod's web-identity token.

```yaml
# values-eks.yaml
replicas:
  min: 2
  max: 3

controller:
  leaderElect: true

serviceAccount:
  annotations:
    # an IAM role granting s3:GetObject/PutObject/ListBucket/DeleteObject on the bucket
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/stageset-controller

rollbackStore:
  backend: s3
  s3:
    endpoint: s3.eu-west-1.amazonaws.com
    bucket: my-org-stageset-rollback
    region: eu-west-1
    # no existingSecret → credentials come from the IRSA role above

webhook:
  certMode: cert-manager

networkPolicy:
  enabled: true

metrics:
  serviceMonitor:
    enabled: true
```

```shell
helm upgrade --install stageset-controller \
  oci://ghcr.io/metio/helm-charts/stageset-controller \
  --namespace stageset-system --create-namespace \
  -f values-eks.yaml
```

### Alongside the other Flux controllers

`stageset-controller` is a [Flux](https://fluxcd.io/) citizen and needs no special
wiring to coexist with `source-controller`, `kustomize-controller`,
`helm-controller`, and `notification-controller`. It reads `ExternalArtifact` (and
the standard `GitRepository`, `OCIRepository`, and `Bucket` sources) from
`source-controller`, and `notification-controller` routes its events through an
`Alert` that targets `kind: StageSet` — no Provider/Alert plumbing of its own.
Install it in its own namespace (e.g. `stageset-system`) next to `flux-system`;
the only cluster-scoped pieces are its CRDs, `ClusterRole`, and webhook
configuration.

### Alongside JaaS

[JaaS](https://jaas.projects.metio.wtf/) renders Jsonnet and publishes the result
as an `ExternalArtifact`, which is what a `StageSet` stage consumes — so the two
compose directly. Reference the artifact by name, or name the producing
`JsonnetSnippet` and let `stageset-controller` resolve it (see
[producer-aware sources](/usage/producer-aware-sources/)). They can share a
cluster and namespace or stay separate; both are reconciled by Flux and both apply
under per-tenant impersonation, so the security model is consistent end to end.

## Settings you can tune

The chart wires the controller; you set Helm values. The set worth thinking about
is below — each row is the value, its default, and when you'd change it.
Everything else the chart configures for you (see
[what the chart manages](#what-the-chart-manages)).

| Helm value | Default | When to change |
|---|---|---|
| `replicas.min` / `replicas.max` | `1` / `1` | Raise both to ≥ 2 for HA; set `max > min` to also render an HPA + PDB. |
| `controller.leaderElect` | `true` | Leave on — harmless at one replica, required for HA. |
| `controller.defaultInterval` | `10m` | The reconcile cadence StageSets inherit when they omit `spec.interval`. Lower for faster drift correction cluster-wide. |
| `controller.inventoryMode` | `hybrid` | `applyset` for ApplySet-native tooling; `entries` to drop the ApplySet labels. |
| `controller.inventoryShardCap` | `5000` | Lower only if a stage applies a huge object count and you want smaller inventory objects. |
| `controller.allowedActionHosts` | `[]` | Add host globs your `http` [actions](/usage/actions/) must reach (loopback/link-local are always denied). |
| `controller.noCrossNamespaceRefs` | `false` | `true` to hard-isolate namespaces (deny cross-namespace `sourceRef`/`dependsOn`). |
| `controller.watchNamespaces` | `[]` | Restrict the controller to a namespace list (cache + RBAC pivot to per-namespace bindings); empty watches cluster-wide. See [tenancy](/usage/multi-cluster/#scoping-the-controller-to-a-namespace-set). |
| `rbac.clusterAdmin` | `false` | `true` on **single-tenant** clusters to bind the controller SA to `cluster-admin` so StageSets apply without `serviceAccountName`. See [single-tenant](/usage/multi-cluster/#single-tenant-cluster-admin). |
| `controller.runbookBaseURL` | the docs site | Point at a fork/mirror, or empty to drop the runbook links from Ready messages. |
| `webhook.certMode` | `self-signed` | `cert-manager` if you run cert-manager — see [reference setups](#reference-setups). |
| `gate.enabled` | `true` | Leave on for [progressive delivery](/tutorials/progressive-delivery/) (the Flagger/Argo gate); set `false` to drop the gate Service and endpoint. |
| `rollbackStore.backend` | `none` | `pvc` (RWX) or `s3` to enable [`spec.rollbackOnFailure`](/usage/rollback/); the two are mutually exclusive. |
| `rollbackStore.s3.sse` | `s3` | At-rest encryption for the S3 store (it holds rendered Secret data): `s3` (SSE-S3), `kms` (+`sseKmsKeyId`), or `none`. See [encryption at rest](/usage/rollback/#encryption-at-rest). |
| `networkPolicy.enabled` | `false` | `true` to fence the controller and the unauthenticated gate. |
| `metrics.serviceMonitor.enabled` | `false` | `true` if you scrape with the Prometheus operator. |
| `metrics.prometheusRule.enabled` | `false` | `true` for the bundled [alerts](/installation/operations/#alerts). |
| `serviceAccount.annotations` | `{}` | An IRSA role ARN on EKS so the S3 store uses an IAM role. |
| `namespace.create` | `false` | `true` to have the chart create the install namespace with Pod-Security labels. |
| `resources` | requests = limits | Raise for very large or very busy releases. |

Every option is set the same way — in your values file, applied with
`helm upgrade --install … -f values.yaml`. The [reference setups](#reference-setups)
above are complete, copy-pasteable examples.

## What the chart manages

You do **not** configure these — the chart wires them so the controller behaves
correctly out of the box:

- **Leader election and HA plumbing** — the lease, and the PDB/HPA when
  `replicas.max > replicas.min`.
- **The admission webhook** — the server, its Service, the
  `ValidatingWebhookConfiguration`, and the certificate (cert-manager `Certificate`
  or the in-pod self-signed renewer, per `webhook.certMode`).
- **Endpoints** — metrics, health probes, and the gate, on their Services.
- **RBAC** — the ClusterRole/bindings the controller needs, including the
  `impersonate` verb (it never applies as itself).
- **A hardened pod** — non-root, read-only root filesystem, dropped capabilities,
  seccomp `RuntimeDefault` (see [pod security context](#pod-security-context)).
- **Per-tenant impersonation** — every apply runs as the StageSet's
  `spec.serviceAccountName`.

## Controller flags

The chart sets the controller's command-line flags from your Helm values and its
own defaults — you never pass them directly. For the exhaustive per-flag list with
defaults, see the [Configuration reference](/installation/configuration/), which
also notes which Helm value drives each one.


---

# Tutorials

Source: https://stageset.projects.metio.wtf/tutorials/


End-to-end walkthroughs that stitch several pieces together. Where the
[usage](/usage/) pages each cover one feature in isolation, these follow a whole
task from start to finish.


---

# From Jsonnet to a gated rollout

Source: https://stageset.projects.metio.wtf/tutorials/jsonnet-to-rollout/


This tutorial follows a complete delivery: write [Kubernetes](https://kubernetes.io/docs/)
manifests in [Jsonnet](https://jsonnet.org/) and publish the source through
[Flux](https://fluxcd.io/); [JaaS](https://jaas.projects.metio.wtf/) renders it into a
Flux `ExternalArtifact`, and a StageSet rolls it out with a readiness gate.

The chain is:

```text
Jsonnet in Git/OCI/Bucket  →  JaaS (JsonnetSnippet)  →  ExternalArtifact  →  StageSet
```

This tutorial renders *Jsonnet*, so it goes through JaaS: JaaS turns the Jsonnet
into an `ExternalArtifact` the stage consumes. (If your manifests were already plain
YAML, a stage could read a `GitRepository`/`OCIRepository`/`Bucket` directly — see
[Stage sources](/tutorials/flux-sources/). The renderer is here because the input is
Jsonnet, not because StageSet can't read Git.)

## Prerequisites

- Flux installed (with the `ExternalArtifact` API — Flux ≥ v2.7.0).
- [JaaS](https://jaas.projects.metio.wtf/) installed in operator mode.
- StageSet installed (see [Installation](/installation/kubernetes/)).
- An `apps` namespace, and a `web-deployer` `ServiceAccount` in it whose RBAC can
  apply the workload (the StageSet impersonates it):

  ```shell
  kubectl create namespace apps
  kubectl -n apps create serviceaccount web-deployer
  # bind web-deployer to a Role/ClusterRole that can manage Deployments and
  # Services in the apps namespace — see /usage/multi-cluster/ for the tenancy model
  ```

## 1. Write the manifests in Jsonnet

A small web app, parameterized as a Jsonnet top-level function so the same source
renders for any environment. Commit this as `jsonnet/main.jsonnet` in a Git repo:

```jsonnet
// jsonnet/main.jsonnet
function(name='web', image='registry.internal/web:latest', replicas='2') {
  apiVersion: 'v1',
  kind: 'List',
  items: [
    {
      apiVersion: 'apps/v1',
      kind: 'Deployment',
      metadata: { name: name },
      spec: {
        replicas: std.parseInt(replicas),
        selector: { matchLabels: { app: name } },
        template: {
          metadata: { labels: { app: name } },
          spec: { containers: [{ name: name, image: image }] },
        },
      },
    },
    {
      apiVersion: 'v1',
      kind: 'Service',
      metadata: { name: name },
      spec: { selector: { app: name }, ports: [{ port: 80, targetPort: 8080 }] },
    },
  ],
}
```

Rendering a `kind: List` keeps several resources in one document — both the
kustomize build the controller runs and `kubectl` flatten it transparently.

## 2. Publish the source through Flux

Point a Flux `GitRepository` at the repo so the cluster has the Jsonnet:

```yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: web-manifests
  namespace: apps
spec:
  interval: 1m
  url: https://github.com/acme/web-manifests
  ref:
    branch: main
```

Apply it and wait for the source to sync:

```shell
kubectl apply -f gitrepository.yaml
kubectl -n apps wait --for=condition=Ready gitrepository/web-manifests
```

## 3. Render with JaaS

A `JsonnetSnippet` reads the Jsonnet from that source, passes the parameters as
top-level arguments, and publishes the rendered result as an `ExternalArtifact`:

```yaml
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
metadata:
  name: web
  namespace: apps
spec:
  sourceRef:
    kind: GitRepository
    name: web-manifests
    path: ./jsonnet
  entryFile: main.jsonnet
  tlas:                            # top-level args → the function() parameters
    name: ["web"]
    image: ["registry.internal/web:2.1.0"]
    replicas: ["3"]
```

Apply it; JaaS then publishes an `ExternalArtifact` named `web` in the `apps`
namespace. Confirm it went Ready:

```shell
kubectl apply -f jsonnetsnippet.yaml
kubectl -n apps get externalartifact web
```

## 4. Roll it out with StageSet

Reference the `JsonnetSnippet` as the stage source — StageSet resolves the
producer to its `ExternalArtifact` — and gate the stage on the Deployment becoming
available:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: web
  namespace: apps
spec:
  serviceAccountName: web-deployer      # applies are impersonated as this SA
  stages:
    - name: web
      sourceRef:
        apiVersion: jaas.metio.wtf/v1
        kind: JsonnetSnippet
        name: web
      readyChecks:
        checks:
          - apiVersion: apps/v1
            kind: Deployment
            name: web
```

Apply it, preview the change before it lands, then watch it roll out:

```shell
kubectl apply -f stageset.yaml
stagesetctl diff web -n apps          # preview against live cluster state
stagesetctl get  web -n apps          # per-stage progress
```

## 5. Ship a change

Edit `jsonnet/main.jsonnet` (or bump the `image` TLA on the snippet) and commit.
Flux pulls the new commit, JaaS re-renders and republishes the `ExternalArtifact`,
and StageSet — watching the producer — reconciles the new revision through the
same gate. No StageSet edit required.

### No labels or annotations needed

You do **not** annotate or label anything to make this chain fire. The linkage is
the `sourceRef` itself: the controller watches the source *kinds* (`ExternalArtifact`,
`GitRepository`, `OCIRepository`, `Bucket`, and producers like `JsonnetSnippet`) and,
when one changes, maps it back to every StageSet whose `sourceRef` points at it — then
reconciles those. JaaS works the same way for a snippet's own `sourceRef` and
library references. Discovery is automatic; you only declare the references.

## Versioning the rollout

To gate one-time [migrations](/usage/versioned-migrations/) on a release boundary,
declare the version. The simplest is to pin it on the StageSet, bumped alongside the
image:

```yaml
spec:
  version:
    value: "2.1.0"
  migrations:
    - name: backfill-2-0
      to: "2.0.0"               # runs once when the deployed version crosses 2.0.0
      stage: web
      actions:
        - name: backfill
          job:
            sourceRef:
              name: web-migrations
  stages:
    - name: web
      sourceRef:
        apiVersion: jaas.metio.wtf/v1
        kind: JsonnetSnippet
        name: web
```

### Let the version travel with the rendered manifests

Pinning works, but the cleaner pattern is to let the version ride *inside* the
manifests the snippet renders — so a single value flows from your CI all the way to
the rollout gate. Feed the version into the snippet and stamp it onto the standard
`app.kubernetes.io/version` label (and the image tag, from the same value):

```jsonnet
// web.jsonnet
local version = std.extVar('version');   // supplied by JaaS extVars / your CI
{
  apiVersion: 'apps/v1',
  kind: 'Deployment',
  metadata: {
    name: 'web',
    labels: { 'app.kubernetes.io/version': version },   // ← the version, in the manifest
  },
  spec: {
    template: {
      metadata: { labels: { 'app.kubernetes.io/version': version } },
      spec: { containers: [{ name: 'web', image: 'registry.example/web:' + version }] },
    },
  },
}
```

Then point `version.fromObject` at that object and drop the inline `value` — the
controller reads the label off the rendered `Deployment`:

```yaml
spec:
  version:
    fromObject:
      stage: web
      kind: Deployment
      name: web
      # defaults to the app.kubernetes.io/version label
  migrations:
    - name: backfill-2-0
      to: "2.0.0"
      stage: web
      actions:
        - name: backfill
          job:
            sourceRef:
              name: web-migrations
  stages:
    - name: web
      sourceRef:
        apiVersion: jaas.metio.wtf/v1
        kind: JsonnetSnippet
        name: web
```

Now the version has exactly one source of truth — the value your pipeline feeds the
snippet — and it shows up in the image tag, the version label, *and* the migration
gate together. The same `fromObject` works for a `GitRepository`/`OCIRepository`
source too; only a source that ships a dedicated file wants
[`version.fromArtifact`](/usage/versioned-migrations/#from-a-file-in-the-artifact--versionfromartifact)
instead. See [versioned migrations](/usage/versioned-migrations/) for all three.

## Next

From here, add more [stages](/usage/stages-and-sources/), pre/post
[actions](/usage/actions/), or [update windows](/usage/update-windows/) to turn
this single rollout into a gated, multi-stage release. To parameterize per
environment, see [Parameters](/tutorials/parameters/).


---

# Parameterizing a rollout

Source: https://stageset.projects.metio.wtf/tutorials/parameters/


A rollout takes parameters at two distinct layers, which serve different purposes:

- **Render-time parameters (JaaS).** Change *what gets rendered*. The Jsonnet
  computes its output from top-level arguments (`tlas`) and external variables
  (`externalVariables`). Different values produce a different `ExternalArtifact`.
- **Delivery-time parameters (StageSet `postBuild`).** Inject values *into
  already-rendered manifests*, per stage, by string substitution — the same
  mechanism Flux's `kustomize-controller` uses.

Use render-time parameters for structural logic; use delivery-time parameters to
stamp environment-specific values onto a shared artifact.

## Render-time: JaaS TLAs and external variables

Top-level arguments map to a Jsonnet `function(...)`:

```jsonnet
// main.jsonnet
function(name='web', replicas='2')
  { apiVersion: 'apps/v1', kind: 'Deployment', metadata: { name: name },
    spec: { replicas: std.parseInt(replicas) /* … */ } }
```

```yaml
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
metadata:
  name: web
  namespace: apps
spec:
  sourceRef: { kind: GitRepository, name: web-manifests, path: ./jsonnet }
  tlas:                          # → function(name, replicas)
    name: ["web"]
    replicas: ["3"]
  externalVariables:            # → std.extVar('environment')
    environment: "production"
```

`tlas` is a map of name → list of values (a single-element list for a scalar
argument; multiple values become a JSON array). `externalVariables` are plain
strings read with `std.extVar`.

## Delivery-time: StageSet postBuild substitution

When the rendered manifests carry `${var}` placeholders, a stage substitutes them
at apply time — from inline values, ConfigMaps, and Secrets:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: web
  namespace: apps
spec:
  stages:
    - name: web
      sourceRef:
        apiVersion: jaas.metio.wtf/v1
        kind: JsonnetSnippet
        name: web
      postBuild:
        substitute:
          cluster_name: prod-eu
        substituteFrom:
          - kind: ConfigMap
            name: cluster-vars
          - kind: Secret
            name: cluster-secrets
            optional: true
```

A manifest field like `value: "${cluster_name}"` becomes `value: "prod-eu"` for
this stage.

## Reusing one artifact across environments

The two layers combine into a common pattern: render an environment-*agnostic*
artifact once with JaaS, then have several StageSets — one per environment —
consume that same artifact and stamp their own values with `postBuild`:

```yaml
# staging
spec:
  stages:
    - name: web
      sourceRef: { apiVersion: jaas.metio.wtf/v1, kind: JsonnetSnippet, name: web }
      postBuild:
        substituteFrom:
          - { kind: ConfigMap, name: staging-vars }
---
# production (same artifact, different values)
spec:
  stages:
    - name: web
      sourceRef: { apiVersion: jaas.metio.wtf/v1, kind: JsonnetSnippet, name: web }
      postBuild:
        substituteFrom:
          - { kind: ConfigMap, name: production-vars }
```

One render, many environments — each StageSet bounded by its own
[ServiceAccount](/usage/multi-cluster/) and gated by its own
[actions](/usage/actions/) and [update windows](/usage/update-windows/).


---

# Progressive delivery

Source: https://stageset.projects.metio.wtf/tutorials/progressive-delivery/


`StageSet` integrates with both progressive-delivery controllers:
[Flagger](https://flagger.app/) and
[Argo Rollouts](https://argoproj.github.io/argo-rollouts/). The controller exposes
a read-only gate endpoint and a readiness gauge so either one can hold a promotion
until a `StageSet` stage is healthy; ready checks let a stage wait on a Rollout in
return. Pick the section for your controller below — see also
[StageSet vs Argo Rollouts](/comparisons/argo-rollouts/).

## The gate contract

The gate endpoint backs the Flagger integration and the Argo Rollouts JSON-metric
option.

```text
GET /gate/{namespace}/{stageset}/{stage}
  200  — the stage is Ready at the currently pinned revision
  403  — the stage is not Ready (or not found / not gateable)
```

It is served on `--gate-bind-address` (default `:8082`) and exposed by the chart's
`stageset-controller-gate` Service (`gate.enabled`, on by default). The endpoint is
**unauthenticated and read-only**, so fence it with a `NetworkPolicy`
([production](/installation/production/#network-policy)) to admit only your
delivery controller.

## Flagger

Add a `confirm-promotion` (or `confirm-rollout`) webhook to a Flagger `Canary`
pointing at the gate. Flagger blocks the promotion until the gate returns `200`:

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: web
  namespace: apps
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  analysis:
    interval: 1m
    threshold: 5
    stepWeight: 10
    maxWeight: 50
    webhooks:
      - name: stageset-stage-ready
        type: confirm-promotion
        # gate this canary's promotion on a StageSet stage being Ready
        url: http://stageset-controller-gate.stageset-system:8082/gate/apps/web/web
```

This is independent of the Flagger *strategy*: the same webhook gates a weighted
**canary**, an **A/B test** (header/cookie routing), or a **blue-green** promotion
— the gate only answers "is this stage Ready," and Flagger decides what to do with
that answer.

This coordinates two moving parts: Flagger shifts traffic to a new version only once
a StageSet stage that applied the supporting config (a CRD, a migration, a sibling
component) reports Ready.

## Argo Rollouts

Argo Rollouts gates on **analysis metrics** (a query that returns a value to
compare) rather than a webhook's HTTP status, so the controller meets it on its own
terms in two ways.

### Gate on the readiness gauge (recommended)

The controller exports `stageset_stage_ready{namespace,stageset,stage}` (`1` when
the stage is Ready, `0` otherwise). Argo's **Prometheus** metric provider gates on
it directly — no gate endpoint, no Job:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: stageset-stage-ready
  namespace: apps
spec:
  metrics:
    - name: stage-ready
      successCondition: result == 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: max(stageset_stage_ready{namespace="apps",stageset="web",stage="web"})
```

### Gate on the JSON endpoint

The same gate endpoint also answers JSON when asked
(`Accept: application/json`), returning `{"ready": true, …}` with a `200` so Argo's
**web** metric can parse it (Argo treats a non-2xx as an error, so readiness has to
live in the body):

```yaml
spec:
  metrics:
    - name: stage-ready
      successCondition: "result.ready == true"
      provider:
        web:
          url: http://stageset-controller-gate.stageset-system:8082/gate/apps/web/web
          headers:
            - key: Accept
              value: application/json
          jsonPath: "{$}"
```

A **Job-based metric** (`curl -fsS …` against the gate, succeeding only on `200`)
is the fallback when the analysis has no Prometheus or web access.

## The reverse direction: gate a StageSet on a Rollout

The coordination also works the other way. Because
[ready checks](/usage/ready-checks/) accept CEL, a StageSet stage can wait on an
Argo `Rollout` finishing its own progressive rollout before the next stage runs:

```yaml
readyChecks:
  exprs:
    - apiVersion: argoproj.io/v1alpha1
      kind: Rollout
      current: "status.phase == 'Healthy'"
      inProgress: "status.phase in ['Progressing', 'Paused']"
      failed: "status.phase == 'Degraded'"
```

So StageSet can gate Argo (via the gauge/gate) and Argo's outcome can gate
StageSet (via ready checks) — pick whichever direction your release needs.


---

# Quickstart

Source: https://stageset.projects.metio.wtf/tutorials/quickstart/


This tutorial takes you from an empty cluster to one running StageSet. The path
is the shortest one — a single stage pointing directly at a Flux
`GitRepository` that already holds plain manifests. No Jsonnet, no migrations,
no optional knobs.

## Prerequisites

- A [Kubernetes](https://kubernetes.io/docs/) cluster with `kubectl` configured
  against it.
- `helm` 3.x.
- [Flux](https://fluxcd.io/) **v2.7.0 or newer** — the `ExternalArtifact` CRD a
  stage resolves to lands in that version. See
  [Install on Kubernetes](/installation/kubernetes/#prerequisites) for the full
  prerequisites.

## Step 1 — Install the controller

```shell
helm upgrade --install stageset-controller \
  oci://ghcr.io/metio/helm-charts/stageset-controller \
  --namespace stageset-system --create-namespace \
  --wait --timeout 5m
```

See [Install on Kubernetes](/installation/kubernetes/) for the full list of chart
values (HA replicas, rollback store, webhook TLS mode, and so on).

Verify the controller is running:

```shell
kubectl -n stageset-system get deploy stageset-controller
# NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
# stageset-controller    1/1     1            1           30s
```

## Step 2 — Provide a source

A stage reads from a Flux source. The quickest path is a `GitRepository`
pointing at a repo that contains plain Kubernetes manifests:

```shell
cat <<EOF | kubectl apply -f -
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: my-app
  namespace: default
spec:
  interval: 1m
  url: https://github.com/acme/my-app-manifests
  ref:
    branch: main
EOF
```

Wait for it to sync:

```shell
kubectl -n default wait --for=condition=Ready gitrepository/my-app --timeout=2m
```

## Step 3 — Apply a StageSet

Only `spec.stages` is required. Each stage needs a `name` and a `sourceRef`.
`sourceRef.kind` defaults to `ExternalArtifact`; for a `GitRepository` name it
explicitly:

```shell
cat <<EOF | kubectl apply -f -
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: my-app
  namespace: default
spec:
  stages:
    - name: app
      sourceRef:
        kind: GitRepository
        name: my-app
EOF
```

## Step 4 — Confirm it reconciled

```shell
kubectl -n default get stageset my-app
# NAME     READY   AGE
# my-app   True    15s
```

If `READY` is `False`, describe the resource — the `Reason` and `Message` on the
`Ready` condition identify the problem:

```shell
kubectl -n default describe stageset my-app
```

For a richer view of per-stage progress, use the CLI:

```shell
stagesetctl get my-app -n default
```

See [CLI reference](/cli/) for all `stagesetctl` commands.

## Where to go next

- [From Jsonnet to a gated rollout](/tutorials/jsonnet-to-rollout/) — the
  flagship tutorial: render Jsonnet with [JaaS](https://jaas.projects.metio.wtf/),
  gate with readiness checks, add versioned migrations.
- [Stage sources](/tutorials/flux-sources/) — direct `GitRepository`,
  `OCIRepository`, and `Bucket` sources, plus the renderer route.
- [Usage](/usage/) — every configuration knob, one page per concern.
- [Installation](/installation/) — production-grade install: HA, rollback store,
  webhook TLS, NetworkPolicy.


---

# Stage sources — Git, OCI, Bucket

Source: https://stageset.projects.metio.wtf/tutorials/flux-sources/


A stage resolves its `sourceRef` to a [Flux](https://fluxcd.io/) artifact. You have
two routes:

```text
manifests in Git / OCI / Bucket  ──────────────────────────►  StageSet   (direct)
manifests in Git / OCI / Bucket  ──►  a renderer (JaaS)  ──►  ExternalArtifact  ──►  StageSet
```

Use the **direct** route when the source already holds ready-to-apply manifests
(the same thing Flux's `kustomize-controller` consumes). Use the **renderer** route
when you generate manifests first — e.g. evaluating Jsonnet with
[JaaS](https://jaas.projects.metio.wtf/).

Copy-pasteable recipes follow, one per source kind. For how `sourceRef`
resolution works as a concept — and the `path`, `prune`, `patches`, and
`postBuild` knobs that shape a stage — see
[stages and sources](/usage/stages-and-sources/).

## Direct: Git

Point a stage straight at a `GitRepository`:

```yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: web-manifests
  namespace: apps
spec:
  interval: 1m
  url: https://github.com/acme/web-manifests
  ref:
    branch: main
---
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: web
  namespace: apps
spec:
  stages:
    - name: web
      sourceRef:
        kind: GitRepository
        name: web-manifests
      path: ./manifests        # build a sub-path of the repo
```

## Direct: OCI

Manifests pushed as an OCI artifact (e.g. with `flux push artifact`):

```yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
  name: web-manifests
  namespace: apps
spec:
  interval: 5m
  url: oci://ghcr.io/acme/web-manifests
  ref:
    tag: "2.1.0"
---
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: web
  namespace: apps
spec:
  stages:
    - name: web
      sourceRef:
        kind: OCIRepository
        name: web-manifests
```

## Direct: Bucket

Object storage works the same way:

```yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: Bucket
metadata:
  name: web-manifests
  namespace: apps
spec:
  interval: 5m
  provider: generic
  bucketName: manifests
  endpoint: minio.storage.svc:9000
  secretRef:
    name: minio-credentials
---
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: web
  namespace: apps
spec:
  stages:
    - name: web
      sourceRef:
        kind: Bucket
        name: web-manifests
```

## Via a renderer (JaaS)

When the source holds *Jsonnet* rather than plain manifests, render it first.
[JaaS](https://jaas.projects.metio.wtf/) reads a Flux source with a `JsonnetSnippet`
and publishes the rendered result as an `ExternalArtifact`, which the stage then
consumes:

```yaml
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
metadata:
  name: web
  namespace: apps
spec:
  sourceRef:
    kind: GitRepository
    name: web-manifests
    path: ./jsonnet
  entryFile: main.jsonnet
---
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: web
  namespace: apps
spec:
  stages:
    - name: web
      sourceRef:                       # resolve the producer to its ExternalArtifact
        apiVersion: jaas.metio.wtf/v1
        kind: JsonnetSnippet
        name: web
```

Shared libraries arrive the same way:
[JOI](https://github.com/metio/jsonnet-oci-images) publishes Jsonnet libraries as
single-layer OCI images, surfaced as `OCIRepository` + `JsonnetLibrary` pairs a
snippet imports:

```yaml
spec:
  libraries:
    - kind: JsonnetLibrary
      name: k8s-libsonnet
      importPath: k8s          # import 'k8s/...' in your Jsonnet
```

For small or generated snippets, skip the external source and inline the Jsonnet on
the `JsonnetSnippet` (`spec.files`). The end-to-end render-and-roll-out flow is in
[From Jsonnet to a gated rollout](/tutorials/jsonnet-to-rollout/).


---

# Usage

Source: https://stageset.projects.metio.wtf/usage/


Worked examples, smallest first. Each page covers one feature with a complete,
copy-pasteable `StageSet`; the [API reference](/api/stageset/) has the exhaustive
field-by-field detail. See the [CLI](/cli/) for `stagesetctl` commands.


---

# Actions

Source: https://stageset.projects.metio.wtf/usage/actions/


Actions are typed steps the controller runs around a stage's apply. They turn an
ordered apply into an orchestrated rollout — run a migration before the app, gate
the stage on an external check, clean up on failure.

A stage has three action hooks:

- **`pre`** — run before the manifests are built and applied. A failure aborts the
  stage with nothing applied.
- **`post`** — run after the apply is verified. The stage is `Ready` only if these
  all succeed.
- **`onFailure`** — best-effort steps run on any failure from the apply onward.

Each action has a `name`, optional `timeout` and `retries`, and **exactly one**
operation type (`patch`, `http`, `wait`, `job`, `delete`, or `apply`) — enforced
by the validating admission webhook. Actions within a hook run in list order.

```yaml
spec:
  stages:
    - name: application
      sourceRef:
        name: my-app
      actions:
        pre:
          - name: db-migrate
            timeout: 10m
            job:
              sourceRef:
                name: my-app-migrations    # render & await Jobs from this artifact
        post:
          - name: smoke-test
            retries: 3
            http:
              url: https://my-app.internal/healthz
              expectedStatus: [200]
        onFailure:
          - name: page-oncall
            http:
              url: https://alerts.internal/stageset-failed
              method: POST
```

## The six action types

### `job`

Render and await Kubernetes Jobs from an artifact. The classic use is a database
migration that must complete before the app is applied.

```yaml
- name: db-migrate
  job:
    sourceRef:
      name: my-app-migrations
    path: ./jobs
```

### `http`

Call an HTTP endpoint and gate on the response — an approval webhook, a smoke
test, an external readiness probe. Hosts must be permitted by the controller's
`--allowed-action-hosts`; loopback and link-local are always denied. `method`
defaults to `POST`; `expectedStatus` defaults to any `2xx`. The body and headers
can be read from a `Secret` so tokens never sit in the spec:

```yaml
- name: smoke-test
  http:
    url: https://my-app.internal/healthz
    method: GET
    expectedStatus: [200]
    headersFrom:
      - name: gate-token       # Secret name
        key: authorization     # the key names the header; its value is the value
    bodyFrom:
      name: gate-payload       # Secret name
      key: body                # this key's value becomes the request body
```

### `wait`

Block for a fixed duration, or until a [CEL](https://github.com/google/cel-spec)
expression holds against a target object.

```yaml
- name: settle
  wait:
    duration: 30s
- name: until-available
  wait:
    target:
      apiVersion: apps/v1
      kind: Deployment
      name: web
    expr: "status.availableReplicas >= 3"
    timeout: 5m
```

### `patch`

Patch an existing object — flip a feature flag, scale something, annotate. `type`
is `merge` (default) for a strategic-merge patch, or `json6902` for a JSON Patch:

```yaml
- name: enable-traffic
  patch:
    target:
      apiVersion: v1
      kind: Service
      name: web
    type: merge               # default; or json6902
    patch: |
      { "spec": { "selector": { "release": "green" } } }
```

### `delete`

Remove an existing object; a missing object counts as success.

```yaml
- name: drop-old-job
  delete:
    target:
      apiVersion: batch/v1
      kind: Job
      name: legacy-migration
```

### `apply`

Apply transient, rollout-scoped manifests that are **not** inventory-tracked and
are never pruned — a maintenance page, a one-shot canary, a temporary config. With
`wait: true` the action blocks until the applied objects report Ready (kstatus),
bounded by the action `timeout`, so a following `patch` can repoint traffic only
once the resource is serving.

Because the applied objects are never pruned by the inventory diff, stand a
resource up only for the duration of a rollout by pairing an `apply` in `pre` with
a matching `delete` in `post`, and guard a mid-run crash with an `onFailure`
delete:

```yaml
actions:
  pre:
    - name: stand-up-maintenance-page
      apply:
        sourceRef:
          name: maintenance-page    # an ExternalArtifact holding a Pod + Service
        wait: true                  # block until it is serving
  post:
    - name: tear-down-maintenance-page
      delete:
        target:
          apiVersion: v1
          kind: Pod
          name: maintenance-page
  onFailure:
    - name: tear-down-maintenance-page-on-failure
      delete:
        target:
          apiVersion: v1
          kind: Pod
          name: maintenance-page
```

The action ledger gates each step per pinned revision, so a retry or controller
restart never re-applies or re-deletes the resource for the same snapshot.

To run a `job` action only when the deployed version crosses a release boundary,
see [versioned migrations](/usage/versioned-migrations/).


---

# Conflict policies

Source: https://stageset.projects.metio.wtf/usage/conflict-policies/


Conflict policies decide what happens when an apply hits an immutable-field
conflict — a changed `clusterIP`, a `Job` pod template, a `StorageClass` field
that can't be updated in place. By default the controller fails the stage and
reports it, so nothing destructive happens by surprise. A policy opts specific
resources into automatic resolution.

## The three actions

- `Fail` — stop and report (the default; safest).
- `Recreate` — delete and re-create the object to get past an immutable-field
  change.
- `KeepExisting` — leave the live object as-is and move on.

## A default for the whole stage

```yaml
spec:
  stages:
    - name: app
      sourceRef:
        name: my-app
      conflictPolicy:
        default: Fail            # explicit; the safe default
```

The `force: true` shorthand on a stage is equivalent to
`conflictPolicy.default: Recreate`.

## Per-resource rules

Rules recreate exactly the resources that need it while everything else stays
`Fail`. A rule's `target` is a partial selector — any field you omit matches
everything. Rules are evaluated in list order; the **first** rule whose target
matches wins, and an object matching no rule falls back to `default`.

```yaml
      conflictPolicy:
        default: Fail
        rules:
          # a Job's pod template is immutable — recreate it on change
          - target:
              apiVersion: batch/v1
              kind: Job
            action: Recreate
          # never fight an HPA over replica counts
          - target:
              kind: Deployment
              name: web
            action: KeepExisting
```

## Recreating storage

Recreating a `PersistentVolumeClaim` or `PersistentVolume` destroys data, so a
`Recreate` **rule** targeting one is refused unless you explicitly accept the loss:

```yaml
        rules:
          - target:
              kind: PersistentVolumeClaim
              name: scratch
            action: Recreate
            allowDataLoss: true     # required for PVC/PV Recreate, refused otherwise
```

Without `allowDataLoss: true`, a `Recreate` rule targeting a PVC/PV is rejected —
a guardrail against accidentally wiping a volume.


---

# Multi-cluster and tenancy

Source: https://stageset.projects.metio.wtf/usage/multi-cluster/


There are two ways to run the controller, and they map onto two different trust
models. Pick the one that matches your cluster:

- **Multi-tenant** — the controller holds no write access of its own and applies
  every `StageSet` impersonating that `StageSet`'s `serviceAccountName`. Each
  tenant's RBAC bounds what its releases can touch. This is the chart default.
- **Single-tenant** — the cluster has one operator, so per-tenant isolation buys
  nothing. Run the controller under its own identity bound to `cluster-admin` and
  skip impersonation entirely — the model Flux's `helm-controller` uses in its
  default install.

The two sections below set each one up. The optional
[watch scoping](#scoping-the-controller-to-a-namespace-set) narrows *which*
namespaces a multi-tenant controller sees.

## Impersonation (multi-tenant)

The controller never applies your manifests as itself. Set `serviceAccountName`
and every operation for that `StageSet` — build, apply, prune, actions — is
performed impersonating that ServiceAccount. The `StageSet` can do exactly what the
SA's RBAC permits, and nothing more.

```yaml
spec:
  serviceAccountName: payments-deployer    # all writes impersonate this SA
  stages:
    - name: app
      sourceRef:
        name: payments-app
```

Grant the SA only the rights that release needs:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: payments-deployer
  namespace: payments
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
  - kind: ServiceAccount
    name: payments-deployer
    namespace: payments
```

This is the multi-tenancy model: isolation comes from each `StageSet` being bounded
by its tenant SA, not from the controller's own grant — by default the chart gives
the controller `impersonate` and read access, no blanket write. A `StageSet` with no
`serviceAccountName`, or one bound to a too-narrow SA, fails closed rather than
escalating.

## Single-tenant cluster-admin

On a cluster with a single operator, per-`StageSet` impersonation is friction with
no payoff — there is no other tenant to isolate from. Run the controller the way
Flux's `helm-controller` runs by default: under its own ServiceAccount, bound to
the built-in `cluster-admin` ClusterRole. `StageSet`s then omit `serviceAccountName`
and apply as the controller, which can write any kind cluster-wide.

Turn it on with one Helm value:

```yaml
rbac:
  clusterAdmin: true     # bind the controller SA to cluster-admin
```

```bash
helm upgrade --install stageset-controller \
  oci://ghcr.io/metio/helm-charts/stageset-controller \
  -n stageset-system --create-namespace \
  --set rbac.clusterAdmin=true
```

`StageSet`s then need nothing tenancy-related — they apply directly:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: platform
  namespace: stageset-system
spec:
  stages:
    - name: app
      sourceRef:
        name: platform-app    # applied by the controller's cluster-admin identity
```

When `serviceAccountName` is unset and no `kubeConfig` is given, the controller
applies with its own client — so the `cluster-admin` binding is what lets those
`StageSet`s write. The trade-off: every `StageSet` on the cluster has full write
access, so this is for single-tenant clusters only. Leave `rbac.clusterAdmin` at its
default `false` and use [impersonation](#impersonation-multi-tenant) whenever more
than one team shares the cluster. The two mix — a cluster-admin controller still
honors `serviceAccountName` on any `StageSet` that sets it, dropping to that SA's
rights for that release.

## Scoping the controller to a namespace set

By default the controller watches every namespace. To run one controller per
tenant-group instead — disjoint deployments that each see only their own
namespaces — set `controller.watchNamespaces`:

```yaml
controller:
  watchNamespaces:
    - team-a
    - team-b
```

This does two things together:

- **Cache scoping.** The manager's informers only observe `StageSet`s and sources
  in the listed namespaces. Resources elsewhere never enter the cache, so the
  controller cannot act on them even if RBAC would allow it.
- **RBAC pivot.** The chart stops binding the tenant ClusterRole cluster-wide and
  instead renders one `RoleBinding` per listed namespace — defense in depth, so the
  apiserver also refuses out-of-scope calls. (The cluster-scoped webhook-caBundle
  grant stays a `ClusterRoleBinding`, since a `ValidatingWebhookConfiguration` is
  not namespaced.)

Run several releases with disjoint `watchNamespaces` lists to shard the cluster
across independent controller instances. Combine it with impersonation for the
tightest setup: each instance sees only its namespaces, and each `StageSet` is
bounded by its tenant SA.

## Remote clusters

Point a `StageSet` at another cluster with `kubeConfig`, referencing a Secret that
holds a kubeconfig. Combined with `serviceAccountName`, the controller applies to
the remote cluster as the impersonated identity there.

```yaml
spec:
  serviceAccountName: payments-deployer
  kubeConfig:
    secretRef:
      name: prod-eu-kubeconfig
      # key defaults to "value" (the Flux convention); set it to override
  stages:
    - name: app
      sourceRef:
        name: payments-app
```

The Secret is read with the controller's own identity — connecting to the target
cluster is the controller's job — and the kubeconfig payload defaults to the
`value` key. A self-contained kubeconfig is required; `configMapRef`-style
cloud-provider auth is not supported.

Cross-namespace `sourceRef` and `dependsOn` references can be disabled
cluster-wide with the controller's `--no-cross-namespace-refs` flag when you want
hard namespace isolation.


---

# Producer-aware sources

Source: https://stageset.projects.metio.wtf/usage/producer-aware-sources/


[Stages and sources](/usage/stages-and-sources/#source-kinds) covers the two
direct routes — an `ExternalArtifact` (the default `sourceRef.kind`) or a Flux
`GitRepository`/`OCIRepository`/`Bucket`. The third option names the thing that
*produces* an artifact and lets the controller find it. This is useful when an
operator publishes an `ExternalArtifact` from a custom resource (for example
[JaaS](https://jaas.projects.metio.wtf/) rendering Jsonnet).

## Referencing a producer

Set `kind` (and `apiVersion`) to a producer resource, and the controller resolves
it to the `ExternalArtifact` that producer publishes — the one whose
`spec.sourceRef` back-references the producer (matched on group, kind, and name).
For example, a [JaaS](https://jaas.projects.metio.wtf/) `JsonnetSnippet`
renders Jsonnet and publishes an `ExternalArtifact`; reference the snippet and the
controller follows the link:

```yaml
spec:
  stages:
    - name: dashboards
      sourceRef:
        apiVersion: jaas.metio.wtf/v1
        kind: JsonnetSnippet
        name: grafana-dashboards
```

The controller also watches the common Flux source kinds (`GitRepository`,
`OCIRepository`, `Bucket`) so a stage re-reconciles when an upstream source
changes.

A producer can itself consume another producer first: a JaaS `JsonnetSnippet` can
render from the artifact another snippet publishes. That chaining happens on the
producer side — see
[chaining snippets](https://jaas.projects.metio.wtf/usage/snippet-sources/#chaining-snippets).
A stage references only the final producer and reads the `ExternalArtifact` it
publishes.

## Related projects

JOI, JaaS, and `StageSet` compose end to end:

- **[JOI](https://github.com/metio/jsonnet-oci-images)** publishes Jsonnet
  libraries as single-layer OCI images (usable both as image-volume mounts and as
  Flux `OCIRepository` sources).
- **[JaaS](https://jaas.projects.metio.wtf/)** evaluates Jsonnet — optionally
  importing those JOI libraries — and publishes the rendered JSON as an
  `ExternalArtifact`.
- **`StageSet`** references the `JsonnetSnippet` (or its artifact) and rolls the
  result out in ordered, gated stages.

Each project is independently useful; a stage reads straight from a
`GitRepository`, `OCIRepository`, or `Bucket`, or from any `ExternalArtifact`
regardless of what produced it.


---

# Ready checks

Source: https://stageset.projects.metio.wtf/usage/ready-checks/


Ready checks decide when a stage is healthy enough to let the next stage start.
They are purely observational — the controller waits and reports, but takes no
action (active steps are [actions](/usage/actions/)).

By default, with no `readyChecks` block, the controller waits for **every** object
the stage applied to report ready via
[kstatus](https://github.com/kubernetes-sigs/cli-utils/tree/master/pkg/kstatus).
`readyChecks` lets you narrow that to specific objects (`checks`), add custom
health for resources kstatus doesn't understand (`exprs`, [CEL](https://github.com/google/cel-spec)),
bound the wait (`timeout`), or skip it entirely (`disableWait`). `checks` and
`exprs` may be set together.

## Explicit objects

Wait for named objects only — useful when a stage applies many objects but only a
few gate the next stage:

```yaml
spec:
  stages:
    - name: infrastructure
      sourceRef:
        name: platform
      readyChecks:
        timeout: 5m
        checks:
          - apiVersion: apiextensions.k8s.io/v1
            kind: CustomResourceDefinition
            name: ledgers.payments.example
          - apiVersion: apps/v1
            kind: Deployment
            name: ledger-operator
            namespace: platform-system
```

## Custom health with CEL

For custom resources kstatus doesn't understand, describe readiness with CEL
expressions. The shape matches `kustomize-controller`'s `healthCheckExprs`, so
expressions are portable.

```yaml
      readyChecks:
        exprs:
          - apiVersion: db.example/v1
            kind: Database
            current: "status.phase == 'Running'"
            inProgress: "status.phase in ['Pending', 'Provisioning']"
            failed: "status.phase == 'Failed'"
```

## Opting out

To apply a stage without waiting for readiness (fire-and-forget), disable the
wait:

```yaml
      readyChecks:
        disableWait: true
```


---

# Rollback

Source: https://stageset.projects.metio.wtf/usage/rollback/


When a run fails, the controller can restore the last successfully-applied artifact
revisions instead of leaving you on a broken release. Rollback is opt-in and needs
somewhere to keep prior revisions.

## Enabling it

```yaml
spec:
  rollbackOnFailure: true
  stages:
    - name: app
      sourceRef:
        name: my-app
```

On a failed run the controller restores each stage's last-good artifact revision,
best-effort, and emits a `RolledBack` event. The coordinates it restores from are
recorded in `status.lastAppliedSnapshot`.

## The rollback store

Rollback needs the prior revision to still be fetchable, so the controller keeps a
copy in a **rollback store**. Configure one on the controller (cluster-wide), via
either a shared filesystem or S3:

```text
# filesystem (an RWX PersistentVolumeClaim)
--rollback-store-path=/var/lib/stageset/rollback

# or S3-compatible object storage
--rollback-store-s3-endpoint=s3.example.com
--rollback-store-s3-bucket=stageset-rollback
```

The two are mutually exclusive. With no store configured, rollback can only use a
prior revision the producer itself still retains; a dedicated store makes rollback
reliable across producer pruning.

### Encryption at rest

The store keeps each stage's rendered output, which includes any `Secret`'s data —
including [SOPS](https://github.com/getsops/sops)-decrypted values (see
[secrets encryption](/usage/encryption/)). Treat it as sensitive and keep it
encrypted at rest:

- **S3** encrypts by default. `--rollback-store-s3-sse` (chart:
  `rollbackStore.s3.sse`) is `s3` (SSE-S3) out of the box; set `kms` with
  `rollbackStore.s3.sseKmsKeyId` for SSE-KMS, or `none` only for a backend that
  cannot honor an SSE header. A rejected SSE write is non-fatal — it warns via a
  `RollbackStoreFailed` event and skips the store write; the rollout still
  succeeds.
- **Filesystem** can't encrypt itself — back the PVC with an **encrypted volume**
  (an encrypted `StorageClass`, LUKS, or cloud-disk encryption). The controller
  logs a reminder at startup when the file store is enabled.

If a restore can't proceed because the previous revision is gone, the run fails
with the `PreviousRevisionUnavailable` reason (see its
[runbook](/runbooks/previousrevisionunavailable/)), and a store problem surfaces as
a `RollbackStoreFailed` event.


---

# Secrets encryption (SOPS)

Source: https://stageset.projects.metio.wtf/usage/encryption/


A stage's source can carry [SOPS](https://github.com/getsops/sops)-encrypted
files — typically a `Secret` whose values are encrypted — and the controller
decrypts them in memory, before building and applying the manifests. This mirrors
Flux's `kustomize-controller` decryption contract, so an existing SOPS-encrypted
repository works unchanged.

Set `spec.decryption` and point it at a Secret holding the keys:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: payments
  namespace: payments
spec:
  serviceAccountName: payments-deployer
  decryption:
    provider: sops          # the only provider
    secretRef:
      name: sops-age        # a Secret in this namespace holding the age key
  stages:
    - name: app
      sourceRef:
        kind: GitRepository
        name: payments-config   # contains an encrypted secret.yaml
```

## Walkthrough — age

[age](https://age-encryption.org/) is the simplest key type and needs no external
service. Take a `Secret` from plaintext to a GitOps-safe rollout in four steps.

**1. Generate an age key.** The file holds the private key; the printed `age1…`
line is the public recipient to encrypt to.

```bash
age-keygen -o age.agekey
# public key: age1qz…
```

**2. Encrypt a Secret.** Encrypt only its values, so the file stays a valid
Kubernetes object, then commit `secret.enc.yaml` (never the plaintext):

```yaml
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: payments-db
  namespace: payments
stringData:
  password: s3cr3t-do-not-commit-plaintext
```

```bash
sops --encrypt --age age1qz… \
  --encrypted-regex '^(data|stringData)$' \
  secret.yaml > secret.enc.yaml
```

**3. Put the private key in the cluster** under a `.agekey` data entry. Store
`age.agekey` itself somewhere safe — it is the only thing that can decrypt the
Secret.

```bash
kubectl create secret generic sops-age \
  --namespace payments \
  --from-file=keys.agekey=age.agekey
```

**4. Decrypt on rollout.** Point a `StageSet` at the source holding
`secret.enc.yaml` and set `spec.decryption` (as in the example above). On reconcile
the controller fetches the source, decrypts every SOPS file in memory, builds, and
applies — so the cluster holds the plaintext `payments-db` Secret while Git only
ever held ciphertext. Grant the deployer ServiceAccount read access to the key
Secret (see [tenancy](#how-keys-are-read--tenancy) below).

## Pairing with JaaS-rendered manifests

A realistic app renders its config from Jsonnet with
[JaaS](https://jaas.projects.metio.wtf/) and keeps only its Secret encrypted. The
two compose cleanly because each owns one concern:

- **JaaS renders the non-secret manifests.** It evaluates Jsonnet server-side and
  cannot hold secret values: SOPS ciphertext carries a MAC over the whole encrypted
  document, so it can't be authored in Jsonnet — and routing plaintext secrets
  through a render service is what you are avoiding.
- **The Secret stays SOPS-encrypted in Git**, as in the walkthrough.
- **The controller decrypts and orders both** under one `spec.decryption`:

```yaml
spec:
  serviceAccountName: payments-deployer
  decryption:
    provider: sops
    secretRef:
      name: sops-age
  stages:
    - name: secrets                 # decrypt + apply the SOPS Secret first
      sourceRef:
        kind: GitRepository
        name: payments-secrets
    - name: app                     # then the JaaS-rendered app that mounts it
      sourceRef:
        apiVersion: jaas.metio.wtf/v1
        kind: JsonnetSnippet
        name: payments-app
```

The `secrets` stage runs first; only once the `Secret` is applied does the `app`
stage roll out the rendered manifests that mount it. The encrypted Secret and the
rendered config live in separate sources, so the Jsonnet author never touches secret
material.

## The fields

- **`provider`** — the decryption backend. Only `sops` is supported.
- **`secretRef.name`** — a Secret in the `StageSet`'s namespace holding the keys,
  using the SOPS conventions: age private keys under data entries ending in
  `.agekey`, armored PGP private keys under `.asc`. Optional — omit it for a
  [cloud-KMS-only](#cloud-kms) setup.

## How keys are read — tenancy

The key Secret is read in the `StageSet`'s namespace **under its
`serviceAccountName`**, exactly like the manifests it applies. A tenant can only
decrypt with key material its own ServiceAccount is allowed to read, so a key in one
namespace is never reachable from another. Grant the deployer SA `get` on the key
Secret:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: payments-deployer-sops
  namespace: payments
rules:
  - apiGroups: [""]
    resources: [secrets]
    resourceNames: [sops-age]
    verbs: [get]
```

In a [single-tenant cluster-admin](/usage/multi-cluster/#single-tenant-cluster-admin)
install (no `serviceAccountName`), the controller reads the key Secret under its
own identity instead.

## Decryption and the rollback store

Decrypted bytes exist only in memory on the apply path. The one place rendered
output is persisted is the optional [rollback store](/usage/rollback/), which is
**encrypted at rest** (S3 SSE by default; an encrypted volume for the file store) —
so a decrypted `Secret` never lands in plaintext on disk. See
[encryption at rest](/usage/rollback/#encryption-at-rest).

A rollback re-fetches the previous source and **runs decryption again** rather than
restoring plaintext, so the key Secret must still exist when a rollback fires. If
the key was rotated or deleted in the meantime, the rollback **fails closed** with
`PreviousRevisionUnavailable` instead of applying a stale or unreadable Secret — an
encrypted store cannot avoid this, and it is the safe failure direction.

## Cloud KMS

SOPS files encrypted with a cloud KMS key (AWS KMS, GCP KMS, Azure Key Vault, or
HashiCorp Vault) decrypt through the **controller's ambient credentials** — e.g. an
IRSA role on EKS, wired via `serviceAccount.annotations`. No in-cluster key Secret
is needed, so `secretRef` may be omitted for a KMS-only `StageSet`:

```yaml
spec:
  decryption:
    provider: sops          # secretRef omitted; KMS uses the controller's identity
```

One consequence to weigh in a multi-tenant cluster: unlike age (read under the
tenant SA), **cloud KMS uses the controller's identity**, so any `StageSet` can
decrypt a file encrypted with a KMS key the controller's role can access. This
matches Flux's `kustomize-controller`. Scope the controller's KMS grant
accordingly, or use age keys for hard per-tenant isolation.

## What's supported

- **age** keys via `secretRef` — read under the tenant SA. The resource-level
  pattern (`--encrypted-regex '^(data|stringData)$'`) is the tested path.
- **PGP** keys via `secretRef` (`.asc` entries) — read under the tenant SA, pure
  Go, no `gpg` binary or keyring needed. See [PGP keys](#pgp-keys).
- **Cloud KMS** (AWS/GCP/Azure/Vault) via the controller's ambient credentials.
- **Encrypted files feeding a `secretGenerator`** — an encrypted `.env` (or other
  file) referenced by a kustomize `secretGenerator` is decrypted before the build,
  so the generated `Secret` carries the plaintext.
- A file with no SOPS metadata passes through untouched, so encrypted and plain
  manifests can sit side by side in one source.

## PGP keys

PGP works **tenant-scoped**, like age: put one or more armored private keys in the
`secretRef` Secret under data entries suffixed `.asc`. The data key is decrypted in
pure Go (`ProtonMail/go-crypto`) directly from those keys — **no `gpg` binary, no
GnuPG keyring, and no `GNUPGHOME`** — and the keys are read under the `StageSet`'s
`serviceAccountName`, so a tenant can only use material its ServiceAccount can read.

```bash
# export the armored private key and load it into the key Secret
gpg --export-secret-keys --armor 0xYOURFINGERPRINT > key.asc

kubectl create secret generic sops-keys \
  --namespace payments \
  --from-file=pgp.asc=key.asc
```

```yaml
spec:
  decryption:
    provider: sops
    secretRef:
      name: sops-keys      # holds the *.asc private key(s)
```

One Secret can carry both age (`*.agekey`) and PGP (`*.asc`) keys; the right one is
used per file. For a fresh setup, age is simpler and the recommended default, but an
existing PGP-encrypted repository needs no migration.


---

# Stages and sources

Source: https://stageset.projects.metio.wtf/usage/stages-and-sources/


A `StageSet` is an ordered list of stages. Each stage resolves a
[Flux](https://fluxcd.io/) source — a `GitRepository`, `OCIRepository`, `Bucket`,
or an `ExternalArtifact` (the default) — applies its manifests, waits for them to
become healthy, and only then lets the next stage start.

## One stage

The minimum is one stage pointing at one artifact in the same namespace:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: my-app
  namespace: default
spec:
  stages:
    - name: app
      sourceRef:
        name: my-app          # an ExternalArtifact
```

`sourceRef.kind` defaults to `ExternalArtifact`, so the common case is a single
line. The controller fetches the artifact, applies every manifest in it, and marks
the stage `Ready` once the applied objects report healthy.

## Source kinds

A `sourceRef` resolves to a Flux artifact three ways. Point it at whichever you
already have:

```yaml
# 1. an ExternalArtifact (the default — kind omitted)
sourceRef:
  name: my-app

# 2. a classic Flux source, consumed directly
sourceRef:
  kind: GitRepository        # or OCIRepository, or Bucket
  name: my-app-manifests

# 3. a producer that publishes an ExternalArtifact (resolved via its back-pointer)
sourceRef:
  apiVersion: jaas.metio.wtf/v1
  kind: JsonnetSnippet
  name: my-app
```

`GitRepository`, `OCIRepository`, and `Bucket` carry the same `status.artifact`
contract as `ExternalArtifact`, so the controller reads them directly — no producer
in between. A stage can apply manifests straight from a Git repo or an OCI artifact,
like Flux's own `kustomize-controller`. For the producer case (for example
rendering Jsonnet with [JaaS](https://jaas.projects.metio.wtf/)), see
[producer-aware sources](/usage/producer-aware-sources/).

## Ordered stages

Add more stages and they run top to bottom — each one waits for the previous to be
`Ready`:

```yaml
spec:
  stages:
    - name: crds          # 1 ── install the CRDs first
      sourceRef:
        name: platform-crds
    - name: operator      # 2 ── then the operator that needs them
      sourceRef:
        name: platform-operator
    - name: workloads     # 3 ── then the workloads it manages
      sourceRef:
        name: team-workloads
```

This is the core of a `StageSet`: `operator` is never applied until `crds` is
healthy, so the operator never crash-loops waiting for a CRD that isn't there yet.

## Shaping a stage's manifests

A stage can build from a sub-path of the artifact, customize with patches, and
substitute variables — the [kustomize](https://kubectl.docs.kubernetes.io/)-style
surface:

```yaml
spec:
  stages:
    - name: app
      sourceRef:
        name: my-app
      path: ./overlays/production      # build a sub-path of the artifact
      prune: true                      # GC objects that leave this stage (default)
      patches:
        - patch: |
            - op: replace
              path: /spec/replicas
              value: 6
          target:
            kind: Deployment
            name: web
      postBuild:
        substitute:
          cluster_name: prod-eu
        substituteFrom:
          - kind: ConfigMap
            name: cluster-vars
          - kind: Secret
            name: cluster-secrets
            optional: true
```

- **`path`** builds from a directory inside the artifact (default `./`).
- **`prune`** (default `true`) garbage-collects objects that fall out of the stage
  between reconciles, tracked precisely via the stage's
  [`StageInventory`](/api/stageinventory/).
- **`patches`** are strategic-merge or JSON6902 patches applied after the build.
- **`postBuild`** substitutes `${var}` references from inline values, ConfigMaps,
  and Secrets at delivery time — see [parameterizing a rollout](/tutorials/parameters/)
  for the full render-time-vs-delivery-time treatment.

From here, layer on [actions](/usage/actions/) to gate the stage, or
[ready checks](/usage/ready-checks/) to define what "healthy" means.


---

# Update windows

Source: https://stageset.projects.metio.wtf/usage/update-windows/


Update windows gate *when* new artifact revisions roll out, without pausing
reconciliation. Drift correction keeps running; only the rollout of a *new*
revision is held until a window allows it.

## Deny a recurring window

Freeze rollouts during business hours:

```yaml
spec:
  stages:
    - name: app
      sourceRef:
        name: my-app
  updateWindows:
    - type: Deny
      schedule: "0 9 * * MON-FRI"   # 5-field cron: start of the window
      duration: 8h
      timeZone: Europe/Berlin
```

A new revision that arrives inside the window is held; `status.pendingUpdate`
records what is waiting and `nextWindowOpens` when it will ship. The controller
emits an `UpdateDeferred` event and increments `stageset_update_deferred_total`.

## Allow-list windows

If any `Allow` window exists, rollouts happen **only** inside an active Allow with
no active Deny — `Deny` always wins. This expresses "only deploy on Tuesday and
Thursday afternoons":

```yaml
  updateWindows:
    - type: Allow
      schedule: "0 14 * * TUE,THU"
      duration: 3h
      timeZone: America/New_York
```

## A one-off freeze

Absolute windows use `from`/`to` instead of a schedule — for a planned event
freeze:

```yaml
  updateWindows:
    - type: Deny
      from: 2026-12-24T00:00:00Z
      to:   2026-12-27T00:00:00Z
```

## What a closed window blocks

`windowScope` controls what a closed window holds back:

- **`Updates`** (default) — hold only the rollout of a *new* artifact revision.
  Drift correction keeps re-applying the pinned state, so the live cluster stays
  on its last-approved revision but doesn't fall out of sync.
- **`All`** — a hard freeze: also pause drift correction, so the controller
  applies nothing at all while the window is closed.

```yaml
  windowScope: Updates   # default: hold new revisions, keep correcting drift
  # windowScope: All     # hard freeze: also pause drift correction
```

## Shipping anyway

To push a held rollout through immediately, override the window with
[`stagesetctl`](/cli/):

```shell
stagesetctl reconcile my-app --update-now
```

This stamps the `stages.metio.wtf/update-now` annotation; the honored value is
recorded in `status.lastHandledUpdateOverride`.


---

# Versioned migrations

Source: https://stageset.projects.metio.wtf/usage/versioned-migrations/


Some changes only need to happen once, when you cross a release boundary — a
one-time data backfill on the way to 2.0, a schema conversion between 1.x and 2.x.
Versioned migrations run a ladder of [actions](/usage/actions/) exactly when the
deployed version steps over the boundary, and never again.

Versioning is off until you set `spec.version`.

## Declaring the version

The controller needs to know *what version is currently being deployed*. There are
three ways to declare it; pick by **where the version lives**.

| Source | The version lives… | Best for |
|---|---|---|
| [`version.value`](#inline--versionvalue) | on the `StageSet` | environment-pinned versions, quick starts |
| [`version.fromObject`](#from-a-rendered-object--versionfromobject) | inside the manifests | **any source, including JaaS** — the recommended default |
| [`version.fromArtifact`](#from-a-file-in-the-artifact--versionfromartifact) | a file in the artifact | Git/OCI/Bucket sources that can ship a `VERSION` file |

Whichever you choose, the resolved value is trimmed and parsed as semver (a leading
`v` is accepted). A missing stage/object/file, an empty value, or an unparseable
one fails terminally with the `InvalidVersion` reason (see its
[runbook](/runbooks/invalidversion/)) — a half-versioned system is worse than an
unversioned one.

### Inline — `version.value`

The `StageSet` author pins the version directly. Use this when the version is a
property of the environment rather than of the content, or to get started quickly:

```yaml
spec:
  version:
    value: "2.1.0"      # bump this when you cut a release
```

The trade-off: the version is declared here, not carried by the content, so you
bump it by editing the `StageSet`.

### From a rendered object — `version.fromObject`

The recommended way to let the version travel with the content.
[Kubernetes](https://kubernetes.io/docs/) has a standard place for an
application's version: the `app.kubernetes.io/version` label. Well-formed manifests
set it, so the version is already inside the manifests — `fromObject` reads it back.
This works for every source kind, including a single-document renderer like
[JaaS](https://jaas.projects.metio.wtf/) that has no room for a separate file.

```yaml
spec:
  version:
    fromObject:
      stage: app            # which stage's rendered manifests carry it
      kind: Deployment      # the object to read
      name: web
      # fieldPath omitted → reads metadata.labels['app.kubernetes.io/version']
  stages:
    - name: app
      sourceRef:
        name: my-app
```

The controller builds the `app` stage's manifests (the same render it applies),
finds the `Deployment/web` object, and reads its `app.kubernetes.io/version` label.
Because the label is part of the manifests, the version changes in lockstep with
the content — no second file to keep in sync.

**Reading a different field.** Set `fieldPath` to a kubectl-style JSONPath that
resolves to the bare version string. (It must be the version *only*; a JSONPath
can't split an `image: web:2.1.0` value, so prefer the label.) `apiVersion` is
optional and narrows the match when a `Kind`+`Name` pair would be ambiguous:

```yaml
spec:
  version:
    fromObject:
      stage: app
      apiVersion: v1
      kind: ConfigMap
      name: app-meta
      fieldPath: "{.data.version}"   # must resolve to a bare semver, e.g. 2.1.0
```

This is the path the [Jsonnet-to-rollout tutorial](/tutorials/jsonnet-to-rollout/)
uses: the snippet renders the version into the manifest's version label, and the
StageSet reads it straight back.

### From a file in the artifact — `version.fromArtifact`

The version travels with the content as a **dedicated file** containing a single
semver. This fits **Git/OCI/Bucket** sources, where you can ship an extra file
beside the manifests. (It does *not* fit JaaS `rendered` output, which is a single
`rendered.json`; use `fromObject` there.)

**Who writes it, and where:** the artifact's producer. For a Git source, commit a
`VERSION` file in the repo; for an OCI/Bucket artifact, include it in the pushed
tree. The file lives at `path` inside the named stage's artifact, relative to the
artifact root:

```text
# VERSION — committed alongside the manifests it versions
2.1.0
```

```yaml
spec:
  version:
    fromArtifact:
      stage: app          # which stage's artifact carries the file
      path: VERSION       # the file's path inside that artifact (cleaned; no leading ./)
  stages:
    - name: app
      sourceRef:
        kind: GitRepository
        name: my-app
```

The controller fetches the `app` stage's artifact and reads the file at `path`.

## Declaring migrations

Each migration names the boundary it crosses (`to`, optionally `from`), the stage
it anchors before, and the actions to run:

```yaml
spec:
  version:
    fromArtifact:
      stage: app
      path: VERSION
  migrations:
    - name: backfill-ledger-2-0
      from: "1.*"               # optional: only when coming from a 1.x
      to:   "2.0.0"             # the boundary this migration crosses
      stage: app               # runs before this stage's pre-actions
      actions:
        - name: backfill
          job:
            sourceRef:
              name: ledger-backfill-job
  stages:
    - name: app
      sourceRef:
        name: my-app
```

When the deployed version crosses from a `1.x` into `2.0.0`, the `backfill` job
runs once, anchored before the `app` stage. The controller tracks progress so a
retry doesn't re-run a completed migration:

- `status.version` — the deployed version, written only after a fully successful
  run.
- `status.pendingMigrations` — migrations the next run will execute.
- `status.executedMigrations` — the in-flight ledger for the current transition.

Migrations emit `MigrationStarted` / `MigrationCompleted` events (and
`MigrationFailed` on error). A downgrade that would skip a required migration is
refused with the `DowngradeRequiresMigration` reason — see its
[runbook](/runbooks/downgraderequiresmigration/).


---

# CLI

Source: https://stageset.projects.metio.wtf/cli/


`stagesetctl` previews, renders, and drives StageSets without waiting for the next
reconcile. It speaks to the cluster with your own kubeconfig — nothing about it runs
in-cluster.

Installed on your `PATH` as `kubectl-stageset`, it also works as a kubectl plugin:
`kubectl stageset <command>` is equivalent to `stagesetctl <command>`.

| Command | Purpose |
|---|---|
| [`get`](/cli/get/) | Print a StageSet's status, or list StageSets. |
| [`build`](/cli/build/) | Render a StageSet's manifests to stdout. |
| [`diff`](/cli/diff/) | Preview what a reconcile would change; usable as a CI gate. |
| [`reconcile`](/cli/reconcile/) | Force an out-of-band reconcile. |

## Global flags

Every command accepts the standard kubectl connection flags
(`genericclioptions.ConfigFlags`): `--kubeconfig`, `--context`, `-n/--namespace`,
`--as`, `--as-group`, `--server`, `--token`, `--request-timeout`, and the rest.
`--version` prints the binary version and commit (`<version> (commit <commit>)`).
With no `-n/--namespace`, the command uses the namespace from your current
kubeconfig context, falling back to `default`.

## Exit codes

Every command shares the same baseline:

| Code | Meaning |
|---|---|
| `0` | Success. |
| `2` | Usage or flag error. |
| `3` | Runtime error. |

[`diff`](/cli/diff/) adds one more: it exits `1` when it finds changes (the
`diff(1)` convention), so it can gate a CI pipeline.


---

# stagesetctl build

Source: https://stageset.projects.metio.wtf/cli/build/


Runs the same resolve → fetch → build pipeline the controller uses and writes the
result — a multi-document YAML stream — to stdout. This is what would be applied,
before it is applied. To preview the change against live cluster state instead, use
[`diff`](/cli/diff/).

```text
stagesetctl build NAME [flags]
```

| Flag | Default | Description |
|---|---|---|
| `--stage` | _(all)_ | Render only the named stage(s); repeatable. |
| `--source-dir` | _(none)_ | Use a local artifact tree as `[STAGE=]PATH` instead of fetching from the cluster; repeatable. |
| `--show-secrets` | `false` | Reveal Secret values instead of masking them. |
| `--as-tenant` | `false` | Render impersonating the StageSet's `spec.serviceAccountName` (see [multi-cluster and tenancy](/usage/multi-cluster/)). |

Secret values are masked by default, so the output is safe to paste into a review.
`build` writes YAML unconditionally — there is no output-format flag.

## Example

```shell
stagesetctl build payments --stage application
```

```yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: payments
spec:
  replicas: 6
  selector:
    matchLabels: {app: web}
  template:
    metadata:
      labels: {app: web}
    spec:
      containers:
        - name: web
          image: registry.internal/web:2.1.0
---
apiVersion: v1
kind: Secret
metadata:
  name: web-config
  namespace: payments
type: Opaque
data:
  token: '***'          # masked; pass --show-secrets to reveal
```

`--source-dir` makes `build` work offline — point it at the directory an artifact
would have unpacked to and it skips the cluster fetch, for authoring and CI. The
value is `[STAGE=]PATH`: prefix a stage name to target one stage, or give a bare
path to feed every stage that has no entry of its own. Repeat the flag to map
each stage to its own tree:

```shell
# one stage from a local tree
stagesetctl build payments --stage application --source-dir application=./out

# every stage from one tree (bare path), overriding just infrastructure
stagesetctl build payments \
  --source-dir ./checkout \
  --source-dir infrastructure=./infra-checkout
```


---

# stagesetctl diff

Source: https://stageset.projects.metio.wtf/cli/diff/


By default `diff` performs a
[server-side](https://kubernetes.io/docs/reference/using-api/server-side-apply/)
dry-run apply and exits `1` when there are changes, so it works as a CI gate. It
shows, per object, what a reconcile would create, configure, or delete, plus the
[actions](/usage/actions/) a rollout would run. To see the full rendered manifests
without comparing against the cluster, use [`build`](/cli/build/).

```text
stagesetctl diff NAME [flags]
```

| Flag | Default | Description |
|---|---|---|
| `--stage` | _(all)_ | Diff only the named stage(s); repeatable. |
| `--source-dir` | _(none)_ | Use a local artifact tree as `[STAGE=]PATH`; repeatable. Skips the cluster fetch. |
| `--server-side` | `true` | Server-side dry-run apply diff (needs update/patch RBAC). `false` renders client-side against live objects. |
| `--as-tenant` | `false` | Render and dry-run impersonating `spec.serviceAccountName` (see [multi-cluster and tenancy](/usage/multi-cluster/)). |
| `--show-secrets` | `false` | Reveal Secret values instead of masking. |
| `--show-unchanged` | `false` | Include objects with no change. |
| `--prune` | `true` | Show resources that would be deleted (fell out of inventory). |
| `--color` | `auto` | Colorize output: `auto`, `always`, or `never`. |
| `--exit-code` | `true` | Exit `1` when changes are found. `false` always exits `0` on a clean run. |

## Example

```shell
stagesetctl diff payments
```

```text
--- live
+++ merged
@@ Deployment payments/web @@
 spec:
-  replicas: 3
+  replicas: 6

- ConfigMap payments/old-feature-flags (pruned: fell out of inventory)

Actions to run:
  application:
    pre   db-migrate   job ledger-migrations
    post  smoke-test   http https://payments.internal/healthz
```

Objects that left the stage's [inventory](/api/stageinventory/) show as deletions
(`pruned: …`); pass `--prune=false` to hide them. The trailing `Actions to run`
block lists the [pre/post/onFailure actions](/usage/actions/) a real reconcile
would execute — `diff` never runs them, it only reports them.

A clean run prints nothing and exits `0`; pending changes exit `1`. To inspect
without failing the shell:

```shell
stagesetctl diff payments --color=never --exit-code=false
```

Use `--server-side=false` when you lack apply RBAC and only need a textual
render-versus-live comparison.


---

# stagesetctl get

Source: https://stageset.projects.metio.wtf/cli/get/


With no `NAME`, lists StageSets in the current namespace. With a `NAME`, prints that
StageSet's detail (Ready reason, per-stage phase, revisions, version) — a readable
view of [`StageSet.status`](/api/stageset/#status).

```text
stagesetctl get [NAME] [flags]
```

| Flag | Default | Description |
|---|---|---|
| `-A`, `--all-namespaces` | `false` | List StageSets across all namespaces. |
| `-o`, `--output` | _(table)_ | Output format: empty for the human table, or `yaml` / `json`. |

## Listing

```shell
stagesetctl get -A
```

```text
NAMESPACE   NAME       READY   REASON       STAGES   VERSION   PENDING
payments    payments   True    Succeeded    2/2      2.1.0     -
platform    platform   True    Succeeded    3/3      -         -
staging     web        False   StageFailed  1/2      -         -
```

`STAGES` is `ready/total`; `PENDING` shows `held until <time>` when an
[update window](/usage/update-windows/) is holding a rollout. A `False` `READY`
maps to a [runbook](/runbooks/) by its `REASON`.

## Detail

```shell
stagesetctl get payments -n payments
```

```text
Name:       payments
Namespace:  payments
Ready:      True (Succeeded)
Message:    All 2 stages applied
Version:    2.1.0
Last handled reconcile: 2026-06-15T09:21:04Z
Stages:
  NAME            PHASE   REVISION        ENTRIES
  infrastructure  Ready   sha256:9f3c1a   12
  application     Ready   sha256:1a2b3c   8
```

Conditional lines fill in when the StageSet is in that state: `Suspended: true`
when [`spec.suspend`](/api/stageset/#scheduling) is set, `Pending migrations:`
when a [version boundary](/usage/versioned-migrations/) is queued, and a
`Pending update:` block (next-window time plus the held revisions) when an
[update window](/usage/update-windows/) is holding a rollout — for example:

```text
Ready:      False (UpdateDeferred)
Pending update:
  Next window opens: 2026-06-16T08:00:00Z
  Held: payments/payments-app -> sha256:cafe
```

Add `-o yaml` (or `-o json`) to print the full object instead of the summary — the
machine-readable form for scripting or piping into `jq`/`yq`.


---

# stagesetctl reconcile

Source: https://stageset.projects.metio.wtf/cli/reconcile/


Stamps the `reconcile.fluxcd.io/requestedAt`
[annotation](/api/stageinventory/#well-known-labels-and-annotations) to trigger a
reconcile now, optionally waiting for the controller to report it handled.

```text
stagesetctl reconcile NAME [flags]
```

| Flag | Default | Description |
|---|---|---|
| `--stage` | _(all)_ | Force only this stage to re-run its actions (single-stage reconcile). |
| `--with-source` | `false` | Also re-request the stage sources before reconciling. |
| `--update-now` | `false` | Apply a window-held rollout immediately, bypassing update windows. |
| `--force` | `false` | Proceed even when the StageSet is suspended. |
| `--wait` | `false` | Block until the controller reports the request handled. |
| `--timeout` | `5m` | How long to wait with `--wait`. |

## Example

```shell
stagesetctl reconcile payments -n payments
```

```text
Reconcile requested for StageSet payments (token 2026-06-15T09:30:00Z)
```

Force just one stage to re-run its actions:

```shell
stagesetctl reconcile payments --stage application
```

```text
Reconcile requested for stage "application" of StageSet payments (token 2026-06-15T09:31:12Z)
```

Re-pull sources, push a window-held rollout through, and wait for it:

```shell
stagesetctl reconcile payments --with-source --update-now --wait --timeout 10m
```

`--update-now` is the CLI equivalent of the `stages.metio.wtf/update-now`
annotation — the supported escape hatch when an [update
window](/usage/update-windows/) is holding a rollout you need to ship now.


---

# API reference

Source: https://stageset.projects.metio.wtf/api/


The controller owns two custom resources in the `stages.metio.wtf/v1` API group.
Every field is described with its type, whether it is required, its default, and
what it does.

- [`StageSet`](/api/stageset/) — the resource you author: an ordered set of
  stages, with scheduling, security, gating, versioning, and rollback.
- [`StageInventory`](/api/stageinventory/) — controller-managed; records the
  objects each stage applied so the controller can prune precisely.


---

# StageInventory

Source: https://stageset.projects.metio.wtf/api/stageinventory/


```yaml
apiVersion: stages.metio.wtf/v1
kind: StageInventory
```

A `StageInventory` records the set of objects a single stage has applied, so the
controller can prune precisely and tear stages down in reverse order. **You do not
author these** — the controller creates, updates, and deletes them. The fields
below let you read inventory state when debugging and back it up.

One stage may be backed by several `StageInventory` shards once it exceeds
`--inventory-shard-cap` entries (default 5000). Shard `0` doubles as the ApplySet
([KEP-3659](https://github.com/kubernetes/enhancements/tree/master/keps/sig-cli/3659-kubectl-applyset))
parent object for the stage.

## spec

A `StageInventory` as the controller writes it (read-only — never hand-author one):

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageInventory
metadata:
  name: payments-application-0          # <stageset>-<stage>-<shard>
  namespace: payments
  labels:
    stages.metio.wtf/stage-set: payments
    stages.metio.wtf/stage: application
    stages.metio.wtf/shard: "0"
spec:
  stagePosition: 1                       # the stage's index in spec.stages
  entries:                               # identifiers only — never object contents
    - id: payments_web_apps_Deployment   # namespace_name_group_kind
      v: apps/v1                         # the applied API version
    - id: payments_web__Service          # empty group → core/v1
      v: v1
```

The inventory is stored in `spec` (not `status`) on purpose: backup tooling that
restores `spec` preserves the prune history, so a restored controller does not
orphan or wrongly prune objects.

| Field | Meaning |
|---|---|
| `stagePosition` | The stage's index in `spec.stages` at write time. Teardown walks inventories in reverse position order. |
| `entries[].id` | An applied object's identifier, form `namespace_name_group_kind` (empty group for core). |
| `entries[].v` | The API version the object was applied at. |

## Well-known labels and annotations

The controller stamps these onto inventories and managed objects:

| Key | On | Meaning |
|---|---|---|
| `stages.metio.wtf/stage-set` | inventory | Owning StageSet name. |
| `stages.metio.wtf/stage` | inventory | Stage name. |
| `stages.metio.wtf/shard` | inventory | Shard index. |
| `stages.metio.wtf/prune` | managed object | Set to `disabled` to opt an object out of pruning. |

Other annotations the controller honors on a StageSet or its objects. Each has a
[`stagesetctl reconcile`](/cli/reconcile/) equivalent:

- `reconcile.fluxcd.io/requestedAt` — request an out-of-band reconcile.
- `stages.metio.wtf/reconcile-stage` — force a single stage to re-run.
- `stages.metio.wtf/update-now` — push a window-held rollout through immediately.

## Inventory modes

`--inventory-mode` selects how applied state is tracked:

- `entries` — identifiers recorded in `StageInventory` only.
- `hybrid` (default) — identifiers plus ApplySet labels for tooling
  compatibility.
- `applyset` — ApplySet-native.

The mode satisfied by the stored inventory is surfaced on
`StageSet.status.inventoryMode`.


---

# StageSet

Source: https://stageset.projects.metio.wtf/api/stageset/


```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
```

A `StageSet` is a namespaced [Kubernetes](https://kubernetes.io/docs/) resource
describing an ordered set of stages. Only `spec.stages` is required; everything else
refines scheduling, security, gating, versioning, and rollback. Every field below is
shown in YAML at least once.

The smallest valid StageSet:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: my-app
  namespace: default
spec:
  stages:
    - name: app
      sourceRef:
        name: my-app
```

---

## Scheduling

```yaml
spec:
  interval: 5m                  # optional: reconcile cadence (default: --default-interval)
  retryInterval: 1m             # cadence after a failed run (default: interval)
  driftDetectionInterval: 2m    # faster drift correction than interval (optional)
  timeout: 5m                   # default per-stage timeout (optional)
  suspend: false                # pause reconciliation without deleting (default false)
```

- **`interval`** (optional) — steady-state reconcile cadence; each reconcile
  re-resolves sources, re-asserts desired state (correcting drift), and prunes.
  **When omitted, the controller's `--default-interval` is used** (the chart's
  `controller.defaultInterval`, default `10m`), so most StageSets can leave it out.
- **`retryInterval`** — retry cadence after a failure; falls back to `interval`.
- **`driftDetectionInterval`** — a shorter cadence dedicated to healing out-of-band
  drift when you need it tighter than `interval`.
- **`timeout`** — how long any one stage may take before it fails; override per
  stage with `stages[].timeout`.
- **`suspend`** — short-circuits to `Ready=False / Suspended`, leaving applied state
  running. Use [`stagesetctl reconcile --force`](/cli/reconcile/) to run once while
  suspended. See the [`Suspended` runbook](/runbooks/suspended/).

## Ordering between StageSets

`dependsOn` gates this StageSet on others being Ready at their observed generation
— cross-release ordering. (Ordering *within* a StageSet is the order of `stages`.)

```yaml
spec:
  dependsOn:
    - name: platform
      namespace: platform-system
```

## Security and targeting

```yaml
spec:
  serviceAccountName: payments-deployer   # impersonated for every cluster operation
  kubeConfig:
    secretRef:
      name: prod-eu-kubeconfig            # apply to a remote cluster
  decryption:
    provider: sops                        # decrypt SOPS files in stage sources
    secretRef:
      name: sops-age                      # holds an age key under *.agekey
```

- **`serviceAccountName`** — the ServiceAccount the controller impersonates; the
  StageSet can do exactly what its RBAC allows. See
  [multi-cluster and tenancy](/usage/multi-cluster/).
- **`kubeConfig.secretRef`** — a Secret holding a kubeconfig for a remote cluster.
  Only `secretRef` is accepted.
- **`decryption`** — decrypt SOPS-encrypted files (`age`) in every stage's source
  before they are built. `provider` is `sops`; `secretRef` names the key Secret,
  read under `serviceAccountName`. See [secrets encryption](/usage/encryption/).

## Versioning and migrations

Versioning is off unless `spec.version` is set. Set **exactly one** of
`value` / `fromObject` / `fromArtifact`:

```yaml
spec:
  version:
    # fromObject reads the version from a rendered object — by default the
    # app.kubernetes.io/version label, so it travels in the manifests (works for
    # every source kind, including JaaS). The recommended default.
    fromObject:
      stage: app
      kind: Deployment
      name: web
      # apiVersion: apps/v1            # optional; narrows an ambiguous Kind+Name
      # fieldPath: "{.data.version}"   # optional JSONPath; defaults to the version label
    # value: "2.1.0"                   # …or pin it inline
    # fromArtifact: { stage: app, path: VERSION }   # …or read a VERSION file (Git/OCI/Bucket)

  migrations:
    - name: backfill-ledger-2-0 # idempotency-ledger / Events name
      from: "1.*"               # optional: constrain the version it applies from
      to:   "2.0.0"             # required: the boundary this migration crosses
      stage: app                # runs before this stage's pre-actions
      actions:                  # the same Action shape used by stages (see below)
        - name: backfill
          job:
            sourceRef:
              name: ledger-backfill-job
```

See [versioned migrations](/usage/versioned-migrations/).

## Rollback

```yaml
spec:
  rollbackOnFailure: true       # restore last-good revisions on a failed run
```

Needs a rollback store configured; see [rollback](/usage/rollback/).

## Update windows

Gate *when* new revisions roll out. Each window is `Allow` or `Deny`, recurring
(cron) **or** absolute (from/to). `windowScope` controls how strict a closed window
is.

```yaml
spec:
  windowScope: Updates          # Updates (default): hold rollouts, keep correcting
                                # drift. All: a hard freeze — no applies at all.
  updateWindows:
    - type: Deny                # Deny always wins over Allow
      schedule: "0 9 * * MON-FRI"   # 5-field cron: window start
      duration: 8h
      timeZone: Europe/Berlin   # IANA tz (default UTC)
    - type: Deny                # an absolute one-off freeze
      from: 2026-12-24T00:00:00Z
      to:   2026-12-27T00:00:00Z
```

A recurring window uses `schedule` + `duration`; an absolute window uses
`from` + `to`. See [update windows](/usage/update-windows/).

---

## Stages

`stages` (required, min 1) is the ordered list. A stage with every field set:

```yaml
spec:
  stages:
    - name: app                 # required; DNS-label, unique in the StageSet
      sourceRef:
        name: my-app            # required
        kind: ExternalArtifact  # default; also GitRepository/OCIRepository/Bucket
                                # directly, or a producer (e.g. JsonnetSnippet)
        apiVersion: source.toolkit.fluxcd.io/v1   # required for a producer kind
        namespace: other-ns     # default: the StageSet's namespace
      path: ./overlays/prod     # path inside the artifact (default ./)
      prune: true               # GC objects that leave the stage (default true)
      timeout: 3m               # per-stage timeout (default: spec.timeout)
      force: false              # sugar for conflictPolicy.default: Recreate
      applyHelmHookResources: true  # apply helm.sh/hook objects as ordinary ones
      patches: []               # Kustomize patches applied after build
      conflictPolicy: {}        # see below
      postBuild: {}             # see below
      actions: {}               # see below
      readyChecks: {}           # see below
```

`sourceRef.kind` defaults to `ExternalArtifact`, so the common case is just
`sourceRef: { name: … }`. A `sourceRef` resolves to a [Flux](https://fluxcd.io/)
artifact in one of three ways: an `ExternalArtifact`
([RFC-0012](https://github.com/fluxcd/flux2/tree/main/rfcs), the default), a classic
Flux source — `GitRepository`, `OCIRepository`, or `Bucket` — consumed **directly**,
or any other kind treated as a *producer* and resolved to its `ExternalArtifact` via
the back-pointer index. See
[stages and sources](/usage/stages-and-sources/#source-kinds) and
[producer-aware sources](/usage/producer-aware-sources/).

### patches

[kustomize](https://kubectl.docs.kubernetes.io/) strategic-merge or JSON6902
patches, applied after the build:

```yaml
      patches:
        - patch: |
            - op: replace
              path: /spec/replicas
              value: 6
          target:
            kind: Deployment
            name: web
```

### postBuild

Variable substitution after build and patching:

```yaml
      postBuild:
        substitute:
          cluster_name: prod-eu        # inline key/value
        substituteFrom:
          - kind: ConfigMap            # required: ConfigMap or Secret
            name: cluster-vars
          - kind: Secret
            name: cluster-secrets
            optional: true             # tolerate a missing source
```

### conflictPolicy

Per-resource answers to apply conflicts (immutable fields, ownership):

```yaml
      conflictPolicy:
        default: Fail                  # Fail (default) | Recreate | KeepExisting
        rules:
          - target:                    # partial selector; unset fields match all
              apiVersion: batch/v1
              kind: Job
            action: Recreate
          - target:
              kind: PersistentVolumeClaim
              name: scratch
            action: Recreate
            allowDataLoss: true        # required to Recreate a PVC/PV
```

See [conflict policies](/usage/conflict-policies/).

### readyChecks

Gate when the stage counts as complete:

```yaml
      readyChecks:
        timeout: 5m
        disableWait: false             # true = apply without waiting for readiness
        checks:                        # explicit objects, evaluated with kstatus
          - apiVersion: apiextensions.k8s.io/v1
            kind: CustomResourceDefinition
            name: ledgers.example
        exprs:                         # custom health via CEL expressions (healthCheckExprs shape)
          - apiVersion: db.example/v1
            kind: Database
            current: "status.phase == 'Running'"
            inProgress: "status.phase in ['Pending','Provisioning']"
            failed: "status.phase == 'Failed'"
```

Health expressions use [CEL](https://github.com/google/cel-spec). See
[ready checks](/usage/ready-checks/).

---

## Actions

`stages[].actions` (and `migrations[].actions`) carry typed steps. Each `Action`
has a `name`, optional `timeout`/`retries`, and **exactly one** operation block.

```yaml
      actions:
        pre:        # before apply; failure aborts the stage with nothing applied
          - name: db-migrate
            timeout: 10m
            retries: 2
            job:
              sourceRef: { name: my-app-migrations }
              path: ./jobs
        post:       # after verify; the stage is Ready only if these pass
          - name: smoke-test
            http:
              url: https://my-app.internal/healthz
              method: GET                    # default POST
              expectedStatus: [200]          # default: any 2xx
              headersFrom:
                - name: gate-token
                  key: token
        onFailure:  # best-effort on any failure from apply onward
          - name: page-oncall
            http:
              url: https://alerts.internal/stageset-failed
```

The six operation types — one per Action:

```yaml
# patch — patch an existing object
- name: enable-traffic
  patch:
    target: { apiVersion: v1, kind: Service, name: web }
    type: merge                # merge (default) | json6902
    patch: '{ "spec": { "selector": { "release": "green" } } }'

# http — call an endpoint (hosts gated by --allowed-action-hosts)
- name: approve
  http:
    url: https://gate.internal/approve
    bodyFrom: { name: approve-secret, key: body }

# wait — block for a duration or until a CEL expr holds
- name: settle
  wait:
    duration: 30s
- name: until-available
  wait:
    target: { apiVersion: apps/v1, kind: Deployment, name: web }
    expr: "status.availableReplicas >= 3"
    timeout: 5m

# job — render and await Jobs from an artifact
- name: migrate
  job:
    sourceRef: { name: my-app-migrations }
    path: ./jobs

# delete — remove an existing object (missing = success)
- name: drop-legacy
  delete:
    target: { apiVersion: batch/v1, kind: Job, name: legacy-migration }

# apply — transient, rollout-scoped manifests (NOT inventory-tracked, never pruned)
- name: canary
  apply:
    sourceRef: { name: my-app-canary }
    path: ./
    wait: true                 # block until applied objects report Ready
```

See [actions](/usage/actions/).

---

## status

`status` is controller-owned and read-only. A representative snapshot:

```yaml
status:
  observedGeneration: 7
  conditions:
    - type: Ready
      status: "True"
      reason: Succeeded
      message: All 2 stages applied
  lastHandledReconcileAt: "2026-06-15T09:21:04Z"
  lastAttemptedRevisions: { payments/payments-app: sha256:1a2b }
  lastAppliedRevisions:   { payments/payments-app: sha256:1a2b }
  version: "2.1.0"
  pendingMigrations: []
  executedMigrations: []
  inventoryMode: hybrid
  stages:
    - name: infrastructure
      phase: Ready             # Pending|Applying|Pruning|Verifying|Ready|Failed
      appliedRevision: sha256:9f3c
      entriesCount: 12
      shards: 1
      message: ""
      executedActions: []
      ledgerRevision: sha256:9f3c
  lastAppliedSnapshot:
    - stage: infrastructure
      url: http://source-controller.../infra.tar.gz
      digest: sha256:9f3c
  pendingUpdate:               # set only when a window holds a rollout
    revisions: { payments/payments-app: sha256:cafe }
    nextWindowOpens: "2026-06-16T08:00:00Z"
  lastHandledUpdateOverride: "2026-06-15T09:30:00Z"
```

The `Ready` condition's reason is one of the wire-stable values documented in the
[runbooks](/runbooks/).


---

# Comparisons

Source: https://stageset.projects.metio.wtf/comparisons/


`StageSet` isn't a templating tool and isn't a replacement for your manifest
generator. It's a *delivery* controller: it takes manifests that already exist
(as a [Flux](https://fluxcd.io/) `ExternalArtifact`) and rolls them out in order,
with gates, under continuous reconciliation. These pages place it next to tools
people reach for in the same situations.

| | Generates manifests | Applies them | Continuous reconcile / drift | Ordered stages within a release | Gates / typed actions |
|---|---|---|---|---|---|
| **StageSet** | no | yes | yes | **yes** | **yes** |
| Helm | yes (templates) | yes (`helm upgrade`) | no | hooks + weights | hooks only |
| Kustomize (`kustomize` CLI) | yes (overlays) | no (`kubectl apply`) | no | no | no |
| Flux `kustomize-controller` | no | yes | yes | between Kustomizations | health checks |
| Tanka / kubecfg | yes (Jsonnet) | yes (CLI) | no | dependency order | no |

`StageSet` is complementary to all of them. It consumes manifests produced by
[Helm](https://helm.sh/), [Kustomize](https://kustomize.io/),
[Tanka](https://tanka.dev/), or anything else — its job starts once you have
manifests and need to deliver them carefully.

Progressive-delivery controllers ([Flagger](https://flagger.app/),
[Argo Rollouts](https://argoproj.github.io/argo-rollouts/)) sit at another layer —
traffic shifting for a single workload — and also compose with `StageSet` rather
than replace it; see [vs Argo Rollouts](/comparisons/argo-rollouts/).


---

# StageSet vs Argo Rollouts

Source: https://stageset.projects.metio.wtf/comparisons/argo-rollouts/


[Argo Rollouts](https://argoproj.github.io/argo-rollouts/) and `StageSet` are easy
to mention in the same breath because both roll things out gradually, but they
operate at different layers and are complementary rather than competing.

## What Argo Rollouts does

`Argo Rollouts` replaces a `Deployment` with a `Rollout` that shifts traffic to a new
version **progressively** — canary or blue-green — pausing at weighted steps and
promoting based on **metric analysis** (Prometheus queries, web/Job providers).
Its unit of work is a **single workload's** version transition and the traffic in
front of it.

## What StageSet does

`StageSet` orchestrates a **whole release** as an ordered list of stages, each built
from a Flux `ExternalArtifact` — CRDs before the operator that needs them, a
migration before the app, config before the workload — gating each stage on health
and running typed [actions](/usage/actions/) between them. It does not shift
traffic or run metric analysis; its unit of work is the **multi-component release**
and the order things apply in.

| | Argo Rollouts | StageSet |
|---|---|---|
| Unit of work | one workload's version + its traffic | a multi-stage release of artifacts |
| Mechanism | weighted traffic shifting + metric analysis | ordered apply with readiness gates + actions |
| Promotion driver | analysis metrics (Prometheus, web, Job) | stage readiness (kstatus, CEL) and actions |
| Pruning / inventory | no (owns the Rollout's pods) | yes (ApplySet inventory, per-stage prune) |
| GitOps reconcile | via Argo CD / a GitOps tool | native (Flux controller) |

## They compose

A realistic setup uses both: `StageSet` rolls out the release in order, and a
workload *inside* one stage is itself an Argo `Rollout` doing a canary. `StageSet`
gets the supporting pieces (CRDs, config, migrations) in place and healthy;
`Argo Rollouts` handles the fine-grained traffic progression for that one service.

## Integrating them

Both directions are supported:

- **Argo gating on StageSet.** The controller exports a
  `stageset_stage_ready{namespace,stageset,stage}` gauge that Argo's Prometheus
  metric reads directly, and the stage [gate endpoint](/tutorials/progressive-delivery/)
  also answers JSON for Argo's web metric. So an Argo `Rollout` can hold its
  promotion until a `StageSet` stage is Ready — no Job bridge needed.
- **StageSet gating on Argo.** A `StageSet` stage's [ready checks](/usage/ready-checks/)
  can wait (via CEL) on an Argo `Rollout` reaching `Healthy` before the next stage
  runs.

The full, worked examples for both are in the
[progressive-delivery tutorial](/tutorials/progressive-delivery/#argo-rollouts).
Where the gate's HTTP-status contract is a native fit for
[Flagger](https://flagger.app/), the readiness gauge and JSON endpoint make
`Argo Rollouts` a first-class consumer too.


---

# StageSet vs Flux kustomize-controller

Source: https://stageset.projects.metio.wtf/comparisons/flux/


This is the closest comparison — `StageSet` is built *for*
[Flux](https://fluxcd.io/) and borrows its conventions. Flux's `kustomize-controller`
(and `helm-controller`) reconcile a source into the cluster continuously, exactly
like `StageSet`. The difference is granularity.

## What kustomize-controller gives you

- Continuous reconciliation of a `Kustomization` from a Flux source, with pruning,
  health checks, drift correction, and `dependsOn` ordering **between**
  Kustomizations.
- Impersonation via `serviceAccountName`, `postBuild` substitution, patches — the
  same surface `StageSet` deliberately mirrors.

## Where StageSet differs

- **Ordering within a release.** `kustomize-controller` applies one Kustomization
  as a unit; ordering exists only *between* Kustomizations via `dependsOn`. To
  sequence three steps you create three Kustomizations and wire their
  dependencies. `StageSet` expresses that as one resource with ordered `stages` —
  and the controller waits for each stage's health before the next.
- **Typed actions between steps.** Migrations, HTTP gates, waits, and transient
  applies are first-class [actions](/usage/actions/); in plain Flux you'd model
  these as extra Kustomizations and Jobs.
- **Release-level features.** [Update windows](/usage/update-windows/),
  [versioned migrations](/usage/versioned-migrations/), and
  [rollback](/usage/rollback/) operate across the whole staged release.
- **Source-native.** A stage consumes a `GitRepository`/`OCIRepository`/`Bucket`
  directly (just like `kustomize-controller`), or an `ExternalArtifact` (RFC-0012),
  or a *producer* resolved to its artifact — which is how it also pairs with
  renderers like [JaaS](https://jaas.projects.metio.wtf/).
- **SOPS parity.** Encrypted Secrets in a source decrypt the same way, via
  [`spec.decryption`](/usage/encryption/) (age, PGP, or cloud KMS), so a SOPS-using
  repo ports across unchanged.

## Using them together

`StageSet` sits alongside the other Flux controllers and reuses Flux's source layer,
notifications (`Alert`/`Provider` targeting `kind: StageSet`), and reconcile
annotations. Use `kustomize-controller` for ordinary one-shot reconciliation and
reach for `StageSet` when a release needs ordered, gated stages.


---

# StageSet vs Helm

Source: https://stageset.projects.metio.wtf/comparisons/helm/


[Helm](https://helm.sh/) is two things: a templating engine (charts) and an
imperative release tool (`helm upgrade`). `StageSet` is neither — it's a declarative
delivery controller. The overlap is ordering: Helm's hooks and hook weights give you
*some* sequencing inside a single chart's install/upgrade.

## What Helm gives you

- Templated, parameterized manifests (charts and values).
- Install/upgrade ordering via `helm.sh/hook` (pre-install, post-upgrade, …) and
  `hook-weight`.
- A release history you can roll back to with `helm rollback`.

## Where StageSet differs

- **Continuous reconciliation.** `helm upgrade` is a point-in-time, imperative
  action; nothing re-asserts the state afterward. `StageSet` reconciles on an
  interval, corrects drift, and prunes — it's GitOps, not a one-shot.
- **Ordering across artifacts, not just within one chart.** Helm hooks order
  resources *inside* a release. `StageSet` orders whole *stages*, each its own
  artifact, with readiness gating between them.
- **Typed gates between steps.** Hooks run Jobs; `StageSet` stages can run Jobs,
  HTTP gates, waits, patches, deletes, and transient applies, as pre/post/onFailure
  [actions](/usage/actions/).
- **Identity.** A `StageSet` applies under an impersonated, per-tenant
  `ServiceAccount`; `helm upgrade` runs as whoever ran it.

## Using them together

Render a chart to manifests (e.g. via a producer that publishes an
`ExternalArtifact`) and deliver it with `StageSet`. The controller understands
`helm.sh/hook` resources: `applyHelmHookResources` (default `true`) applies them as
ordinary objects, so a Helm-style chart's hook resources still get created — now
under `StageSet`'s ordering and gating instead of Helm's.


---

# StageSet vs jsonnet-controller

Source: https://stageset.projects.metio.wtf/comparisons/jsonnet-controller/


[jsonnet-controller](https://github.com/pelotech/jsonnet-controller) (pelotech) is a
Flux controller that evaluates Jsonnet (kubecfg- and Tanka-style) and applies the
result to the cluster. Its `Konfiguration` resource (`jsonnet.io/v1beta1`) is, in
effect, *kustomize-controller for Jsonnet*: point it at a `GitRepository` (or an
HTTP(S) URL), and it builds the Jsonnet and reconciles the manifests — with
pruning, health/revision tracking, TLA string/code variables, and `dependsOn`
ordering **between** Konfigurations.

The two projects sit at **different layers**, which is the whole comparison.

## What jsonnet-controller gives you

- **Jsonnet rendering and applying in one resource.** A `Konfiguration` both
  evaluates the Jsonnet and applies the output — the rendering engine is part of
  the controller. If your goal is "get this Jsonnet/Tanka tree into the cluster,"
  it's a direct, single-resource answer.
- The familiar Flux applier surface: prune, health, interval reconciliation, TLAs,
  and `dependsOn` between Konfigurations.

## Where StageSet differs

- **Renderer-agnostic.** `StageSet` does *not* evaluate Jsonnet. A stage consumes a
  Flux source — `GitRepository`, `OCIRepository`, `Bucket`, or an `ExternalArtifact`
  — so it rolls out plain manifests *or* the output of any renderer. For Jsonnet
  specifically, [JaaS](https://jaas.projects.metio.wtf/) does the evaluation
  (TLAs, ext vars, jb-vendored libraries, [JOI](https://github.com/metio/jsonnet-oci-images)
  images) and publishes an `ExternalArtifact` that `StageSet` consumes. Rendering and
  delivery are separate concerns, owned by separate components.
- **Ordering and gating *within* a release.** A `Konfiguration` applies as one unit;
  sequencing exists only *between* Konfigurations via `dependsOn`. `StageSet` expresses
  a release as ordered [stages](/usage/stages-and-sources/), each waiting on the
  previous stage's health, with typed [actions](/usage/actions/) (migration Jobs,
  HTTP gates, waits) *between* steps.
- **Release-level machinery** jsonnet-controller doesn't carry:
  [update windows](/usage/update-windows/),
  [versioned migrations](/usage/versioned-migrations/),
  [conflict policies](/usage/conflict-policies/), and
  [rollback](/usage/rollback/) across the whole staged release.

## Which to reach for

- You want **Jsonnet rendered and applied as a single unit**, no staging →
  jsonnet-controller is a clean fit (as is JaaS paired with Flux's
  `kustomize-controller`).
- You want **ordered, gated, multi-stage delivery** of manifests — whatever renders
  them → `StageSet`, with `JaaS` supplying the Jsonnet rendering when you need it.

They are not mutually exclusive: jsonnet-controller answers *how Jsonnet becomes
manifests*, `StageSet` answers *how a release is sequenced and gated*. You can pick a
renderer independently — `JaaS` is one option that keeps the rendering reusable
(local `jsonnet`-parity, OCI libraries) and hands `StageSet` an artifact like any
other.

## Using them together

Because a `Konfiguration` is a Kubernetes object, a `StageSet` stage can **apply a
`Konfiguration` and gate on it** — letting jsonnet-controller do the Jsonnet
rendering and applying while `StageSet` sequences it among other stages, with actions
and gates in between.

Put the `Konfiguration` manifest in a source the stage reads (here a
`GitRepository`), then gate the stage on the `Konfiguration` reaching `Ready` for
its current generation:

```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
  name: platform
  namespace: platform
spec:
  stages:
    - name: base                     # render + apply the Jsonnet via jsonnet-controller
      sourceRef:
        kind: GitRepository
        name: platform-konfig        # a repo holding the Konfiguration manifest
      readyChecks:
        exprs:
          - apiVersion: jsonnet.io/v1beta1
            kind: Konfiguration
            # don't proceed until jsonnet-controller has reconciled THIS generation
            current: "status.observedGeneration == metadata.generation && status.conditions.exists(c, c.type == 'Ready' && c.status == 'True')"
            inProgress: "status.conditions.exists(c, c.type == 'Ready' && c.status == 'Unknown')"
            failed: "status.conditions.exists(c, c.type == 'Ready' && c.status == 'False')"
    - name: smoke                     # only runs once the Konfiguration is Ready
      sourceRef:
        kind: GitRepository
        name: platform-smoke
```

`StageSet` applies the `Konfiguration` and waits — the `current` expression holds the
rollout until jsonnet-controller has observed the latest generation *and* reports
`Ready`, so a later stage never starts against a half-rendered base.

### Ownership differs from the JaaS path

This is the important distinction, and it changes who prunes what:

- **Via jsonnet-controller (above).** `StageSet`'s inventory owns **only the
  `Konfiguration` object**. The workloads rendered from the Jsonnet are owned and
  **pruned by jsonnet-controller** — `StageSet` never sees them individually. Delete
  the stage and `StageSet` removes the `Konfiguration`; jsonnet-controller then
  cascades the prune of what it created. You get two nested owners.
- **Via [JaaS](https://jaas.projects.metio.wtf/) (or any source).** `JaaS` only
  *renders* — it publishes an `ExternalArtifact` and owns nothing in the target
  cluster. `StageSet` fetches that artifact and applies the manifests itself, so
  **`StageSet`'s inventory owns every rendered object directly** and prunes them with
  its own ApplySet semantics. One owner, and `StageSet`'s drift correction,
  [conflict policies](/usage/conflict-policies/), and [rollback](/usage/rollback/)
  apply to the resources themselves.

So if you want `StageSet` to be the single owner and pruner of the delivered
resources, render with `JaaS` (or read plain manifests from a source). Reach for the
`Konfiguration`-as-a-stage pattern when you specifically want jsonnet-controller to
keep owning what it renders, and only need `StageSet` to sequence and gate it.


---

# StageSet vs Kustomize

Source: https://stageset.projects.metio.wtf/comparisons/kustomize/


[Kustomize](https://kustomize.io/) (the `kustomize` CLI / `kubectl kustomize`) is a
manifest *builder*: it composes bases and overlays, applies patches, and emits YAML.
It does not apply anything, and it has no notion of ordering, readiness, or
reconciliation — that's `kubectl apply`'s job, and `kubectl` applies everything at
once.

## What Kustomize gives you

- Overlay composition, strategic-merge and JSON6902 patches, variable replacement,
  generators.
- A pure transformation: in goes a kustomization, out come manifests.

## Where StageSet differs

- **It delivers, not just builds.** Kustomize stops at YAML. `StageSet` applies it,
  waits for health, prunes what's gone, and keeps doing so.
- **Ordering and gates.** `kubectl apply -k` has no stages and no gates. `StageSet`
  sequences stages and runs [actions](/usage/actions/) between them.
- **Continuous reconciliation and drift correction**, versus a one-shot `apply`.

## Using them together

`StageSet` *includes* the parts of Kustomize you reach for at delivery time: a stage
has `path`, `patches`, and `postBuild` substitution
([stages and sources](/usage/stages-and-sources/)). So you can keep authoring with
Kustomize overlays and let a stage apply the right overlay, patched and
substituted — then add the ordering, gating, and reconciliation Kustomize alone
doesn't offer.


---

# StageSet vs Tanka and kubecfg

Source: https://stageset.projects.metio.wtf/comparisons/tanka-kubecfg/


[Tanka](https://tanka.dev/) and [kubecfg](https://github.com/kubecfg/kubecfg) are
Jsonnet-based config tools: you express your resources in Jsonnet, the tool renders
them, diffs against the cluster, and applies. They generate configuration and run a
CLI-driven apply, but they are imperative tools you run, not controllers that
reconcile.

## What Tanka / kubecfg give you

- Jsonnet-powered, DRY manifest generation (libraries, abstractions, environments).
- A `diff`/`apply` workflow with dependency-aware ordering of a single apply.

## Where StageSet differs

- **Reconciliation, not invocation.** Tanka/kubecfg apply when *you* run them.
  `StageSet` runs in-cluster and continuously reconciles, corrects drift, and
  prunes.
- **Staged, gated delivery.** They apply a rendered set (in dependency order);
  they don't model multi-stage rollouts with readiness gates, update windows, or
  versioned migrations between stages.
- **GitOps identity and tenancy.** `StageSet` applies under an impersonated tenant
  `ServiceAccount` inside the cluster; Tanka/kubecfg use your local credentials.

## Using them together

The Jsonnet *generation* that Tanka and kubecfg do so well has a GitOps-native
equivalent in two related projects:

- **[JOI](https://github.com/metio/jsonnet-oci-images)** ships the Jsonnet
  libraries as OCI images.
- **[JaaS](https://jaas.projects.metio.wtf/)** evaluates the Jsonnet in-cluster
  and publishes the rendered result as an `ExternalArtifact` — the rendering step,
  as a service.
- **`StageSet`** delivers that artifact in ordered, gated stages — the apply step,
  as a controller.

So where you might run `tk apply` or `kubecfg update` from a laptop or CI, this
approach splits the same job into a producer (`JaaS`, importing `JOI` libraries) and
a delivery controller (`StageSet`), both reconciled by Flux. You can also keep using
Tanka/kubecfg to author and publish artifacts, and let `StageSet` handle delivery.


---

# Runbooks

Source: https://stageset.projects.metio.wtf/runbooks/


One page per `status.conditions[Ready].reason` the controller sets, plus a few
operational alert runbooks. Each page covers the symptom, the cause, how to
diagnose it, and how to remediate.

Point the controller at a published copy of these pages with `--runbook-base-url`
(for example `https://stageset.projects.metio.wtf/runbooks`); the reason is then
appended to each actionable Ready message. Healthy reasons (`Succeeded`,
`Suspended`) get no link.


---

# ArtifactNotFound

Source: https://stageset.projects.metio.wtf/runbooks/artifactnotfound/


## Symptom

`READY=False`, `REASON=ArtifactNotFound`. Transient: the controller requeues in case the artifact appears.

## Cause

A stage's `sourceRef` resolves to **no `ExternalArtifact`**. Either:

- a **direct** `sourceRef` (`kind: ExternalArtifact`, the default) names an object that does not exist in the target namespace; or
- a **producer** `sourceRef` (e.g. `kind: JsonnetSnippet`) exists, but no `ExternalArtifact` carries a `spec.sourceRef` back-pointer to it yet — the producer has not created its artifact object.

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>     # Message names the missing ref
kubectl get externalartifact -n <namespace>
```

For a producer ref, confirm the producer object exists and that it is configured to publish an `ExternalArtifact` (not only serve over HTTP):

```shell
kubectl get <producer-kind> <name> -n <namespace> -o yaml
```

## Remediation

- Fix a typo in `sourceRef.name` / `sourceRef.namespace`.
- For a direct ref, create (or wait for) the named `ExternalArtifact`.
- For a producer ref, ensure the producer actually publishes an artifact and that it lands in the same namespace as the StageSet (cross-namespace producer refs are gated by `--no-cross-namespace-refs`).

If the artifact exists but is not yet published, the reason is [`SourceNotReady`](/runbooks/sourcenotready/); a spec/API resolution failure is [`ResolveFailed`](/runbooks/resolvefailed/). See [stages and sources](/usage/stages-and-sources/).


---

# Controller pod down

Source: https://stageset.projects.metio.wtf/runbooks/controller-pod-down/


## Symptom

A `stageset-controller` pod is `NotReady`; the `StageSetControllerPodDown` alert
fires. While no replica is Ready, StageSets are not reconciled and the
[Kubernetes](https://kubernetes.io/docs/) admission webhook may reject `StageSet`
writes (`failurePolicy: Fail`).

## Cause

- a crash-looping container (bad config flag, missing RBAC, panic),
- the node draining or out of resources,
- a failing readiness probe (`/readyz` on `--health-probe-bind-address`),
- the leader-election lease unobtainable.

## Diagnosis

```shell
kubectl -n stageset-system get pods -l app.kubernetes.io/name=stageset-controller
kubectl -n stageset-system describe pod <pod>
kubectl -n stageset-system logs <pod> --previous --tail=200
```

Look for flag-parse errors at startup, RBAC `Forbidden` on the controller's own
`ServiceAccount`, or OOMKills.

## Remediation

- Fix the surfaced cause (correct the flag/values, grant the missing controller
  RBAC, raise resource limits).
- Run more than one replica with leader election so a single pod failure doesn't
  stop reconciliation — see [production](/installation/production/#high-availability).
- If admission is blocking writes during the outage and you must unblock urgently,
  scope or relax the webhook `failurePolicy`, then restore it once the controller
  is healthy.

See [operations](/installation/operations/) for the full alert set and its thresholds.


---

# DependencyNotReady

Source: https://stageset.projects.metio.wtf/runbooks/dependencynotready/


## Symptom

`READY=False`, `REASON=DependencyNotReady`. Transient: the controller requeues at `spec.retryInterval` (or `spec.interval`).

## Cause

A StageSet listed in `spec.dependsOn` is not `Ready` at its observed generation, so this StageSet holds before doing any work. Semantics match kustomize-controller: a dependency is satisfied only when its `Ready=True` **and** its `status.observedGeneration` equals its current generation (so a freshly-edited dependency mid-reconcile does not count as ready).

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>            # Message names the dependency
kubectl get stageset <dependency> -n <namespace>           # is it Ready?
kubectl describe stageset <dependency> -n <namespace>      # why not?
```

## Remediation

Resolve the dependency's own Ready condition first (follow its runbook). Once it reports `Ready=True` at its current generation, this StageSet proceeds on the next reconcile. If the dependency is intentionally [suspended](/runbooks/suspended/), this StageSet waits indefinitely by design — remove the `dependsOn` entry or resume the dependency.

A `dependsOn` **cycle** is reported as [`Stalled`](/runbooks/stalled/), not this reason.


---

# DowngradeRequiresMigration

Source: https://stageset.projects.metio.wtf/runbooks/downgraderequiresmigration/


## Symptom

`READY=False`, `REASON=DowngradeRequiresMigration`. Terminal: the run does not requeue until the desired version is at or above `status.version`.

## Cause

The desired version (`spec.version`) is **lower** than the version the controller last recorded as deployed (`status.version`). Downgrades are refused by default: [migrations](/usage/versioned-migrations/) are forward-only action ladders, and replaying upgrade migrations in reverse is how data gets destroyed. The controller does not silently run a downgrade.

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>
kubectl get stageset <name> -n <namespace> -o jsonpath='{.status.version}'   # deployed
# desired: read spec.version.value, or the version file the artifact carries
```

## Remediation

Pick the intended direction:

- **You did not mean to downgrade** (e.g. a source revert pulled an older version file): roll the source forward again so the desired version is `>=` the deployed version. The StageSet converges normally.
- **You genuinely need to go back**: a downgrade is an operational decision with potential data loss. Perform it deliberately — restore from backup or apply an explicit down-migration out of band — then set `status.version` to match. There is no automatic reverse-migration path by design.


---

# InvalidSpec

Source: https://stageset.projects.metio.wtf/runbooks/invalidspec/


## Symptom

`READY=False`, `REASON=InvalidSpec`. The Message names the offending field or action. Terminal: the controller does not requeue until the spec changes.

## Cause

The spec failed validation that the CRD schema cannot express cheaply, normally one of:

- an **action sets zero or more than one verb** — each action must set exactly one of `patch`, `http`, `wait`, `job`, `delete`, `apply` (see [actions](/usage/actions/));
- **`spec.migrations` without `spec.version`**, or a migration anchored to a stage name that does not exist (see [versioned migrations](/usage/versioned-migrations/));
- **`spec.version` does not name exactly one source** — set one of `value`, `fromObject`, or `fromArtifact`;
- **`spec.decryption.provider` is not `sops`**, or a `secretRef` is given without a `name` (see [encryption](/usage/encryption/));
- an **invalid update window** — a malformed `schedule`, `duration`, or `timeZone` (see [update windows](/usage/update-windows/)).

The admission webhook normally rejects these at write time; seeing this on the object means the webhook was bypassed or disabled and the reconciler caught it.

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>
```

Read the Message — it names the stage, action, or field.

## Remediation

Fix the spec per the Message:

- give each action exactly one verb;
- set `spec.version` (to one of `value`/`fromObject`/`fromArtifact`) whenever `spec.migrations` is present, and anchor each migration to a real stage;
- set exactly one `spec.version` source;
- use `provider: sops` for `spec.decryption`, with a named `secretRef` when one is given;
- correct any malformed update window (`schedule`, `duration`, `timeZone`).

If the webhook should have caught this, confirm the `ValidatingWebhookConfiguration` is installed and its service is reachable.


---

# InvalidVersion

Source: https://stageset.projects.metio.wtf/runbooks/invalidversion/


## Symptom

`READY=False`, `REASON=InvalidVersion`. Terminal: the run does not requeue until the spec or the version file is fixed.

## Cause

A version `spec.version` (or a migration boundary) could not be resolved to a parseable [semver](https://semver.org/). The controller refuses to proceed rather than deploy a half-versioned system — a system whose recorded version is unknown is worse for migrations than an unversioned one. The Message names which input failed. By version source:

- **`spec.version.value`** — the inline string is not a semver.
- **`spec.version.fromObject`** — the named stage doesn't exist; the object (`kind`/`name`) isn't in the stage's rendered manifests; the `fieldPath` is invalid JSONPath or resolves to empty; or the value read (by default the `app.kubernetes.io/version` label) is missing or not a semver.
- **`spec.version.fromArtifact`** — the named stage doesn't exist; the file at `path` is missing from the stage's artifact, empty, or doesn't parse as a semver.
- **`spec.version` sets none** of `value`/`fromObject`/`fromArtifact`.
- A **migration's `to` or `from`** is not a valid semver.
- The recorded **`status.version`** is not a semver (corrupted status).

Common triggers across all of them: a `v` prefix or trailing whitespace the parser rejects, or non-semver text (e.g. a Git SHA or a `latest` tag) where a version was expected.

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>   # Message names the failing input
kubectl get stageset <name> -n <namespace> -o jsonpath='{.spec.version}{"\n"}'
```

Then, depending on which source the Message names:

```shell
# fromObject: confirm the field carries a bare semver on the rendered object
stagesetctl build <name> -n <namespace> --stage <stage> | grep -i version

# fromArtifact: confirm the file exists and contains only a semver (e.g. 2.1.0)
# inspect the resolved artifact for the stage named in the Message
```

## Remediation

Match the failing input in the Message:

- **`value`** — correct the inline string to a bare semver (`2.1.0`, not `v2.1.0`).
- **`fromObject`** — fix the `stage`/`kind`/`name` to point at a real rendered object, fix the `fieldPath`, and ensure the field (default: `app.kubernetes.io/version` label) carries a semver.
- **`fromArtifact`** — fix `path`/`stage` to the real version file, or correct the file to a bare semver.
- **migration `to`/`from`** — correct the boundary to a valid semver.
- If you don't need [versioned migrations](/usage/versioned-migrations/), remove `spec.version` entirely (this disables versioning and migrations).


---

# PreviousRevisionUnavailable

Source: https://stageset.projects.metio.wtf/runbooks/previousrevisionunavailable/


## Symptom

`READY=False`, `REASON=PreviousRevisionUnavailable`. The StageSet has `spec.rollbackOnFailure` set, a run failed, and the controller could not restore the last-good revisions.

## Cause

[`rollbackOnFailure`](/usage/rollback/) restores the previously-applied artifact revisions by re-fetching their recorded URLs and verifying their digests. That only works while the **producer still retains** those revisions. This reason means a revision the rollback needs is no longer fetchable — the producer garbage-collected it.

Rollback is best-effort by contract: it works exactly when producers retain. Common triggers:

- a JaaS `JsonnetSnippet` with `spec.history: 1` (the default) — only the current revision is kept, so there is no previous revision to roll back to
- a stock source-controller source, which retains only the current revision
- the previous revision aged out of the producer's retention window
- the run used [SOPS decryption](/usage/encryption/) and the key Secret was rotated
  or deleted — rollback re-runs decryption rather than restoring plaintext, so it
  fails closed when the key is gone, even for a revision the rollback store holds

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>   # Message names the stage + revision
```

Check the producer's retention. For a JaaS snippet:

```shell
kubectl get jsonnetsnippet <name> -n <namespace> -o jsonpath='{.spec.history}'
```

## Remediation

The cluster is left at the partially-applied failed state; resolve the underlying failure (see the failing stage's own runbook) and fix forward — the StageSet converges once the desired revision applies cleanly.

To make rollback reliable in future, either:

- **Increase producer retention** so at least one previous revision is always fetchable — JaaS snippets used with `rollbackOnFailure` should set `spec.history: 2` (or more); sources that retain only the current revision cannot support the re-fetch path, so rely on source revert instead.
- **Configure the external rollback store** — a filesystem/RWX PVC (`--rollback-store-path`) or an S3 bucket (`--rollback-store-s3-*`); see [operations](/installation/operations/). When the controller pushes rendered output to a store it owns, rollback is bit-exact and **independent of producer retention** — this `PreviousRevisionUnavailable` state cannot occur for runs the store holds, unless the run used SOPS and the key Secret is no longer readable (see the SOPS trigger above).


---

# Reconcile latency high

Source: https://stageset.projects.metio.wtf/runbooks/reconcile-latency/


## Symptom

`controller_runtime_reconcile_time_seconds` p99 for `controller="stageset"` exceeds
the configured threshold; the `StageSetReconcileLatencyHigh` alert fires (see
[operations](/installation/operations/) for the alert set and its thresholds).

## Cause

A single reconcile does a lot of work — resolve and fetch every stage's artifact,
kustomize-build, server-side apply, prune, verify readiness, and run actions — all
impersonating the tenant `ServiceAccount`. Latency climbs when any of those is slow:

- large artifacts or slow artifact servers,
- many objects per stage (apply + prune scale with object count),
- readiness waits and `wait`/`http`/`job` actions with long timeouts,
- apiserver or tenant-authorization slowness.

## Diagnosis

```shell
kubectl -n stageset-system logs deploy/stageset-controller --tail=200 | grep -i 'slow\|timeout\|took'
```

Break the latency down by stage count and artifact size; a single StageSet with
many large stages dominates p99.

## Remediation

- Split a very large StageSet into smaller ones, or fewer objects per stage.
- Tighten action `timeout`s so a slow gate fails fast instead of stretching the
  reconcile.
- Raise `spec.interval` where freshness isn't critical.
- Address upstream artifact-server or apiserver latency.

If the queue itself is backing up, see [workqueue saturation](/runbooks/workqueue-saturation/).


---

# ResolveFailed

Source: https://stageset.projects.metio.wtf/runbooks/resolvefailed/


## Symptom

`READY=False`, `REASON=ResolveFailed`. The Message describes why resolution failed.

## Cause

A stage's `sourceRef` could not be resolved to an `ExternalArtifact` for a spec/config or API reason (distinct from "not published yet", which is [`SourceNotReady`](/runbooks/sourcenotready/), and "no such object", which is [`ArtifactNotFound`](/runbooks/artifactnotfound/)). Common cases:

- an **ambiguous producer** — more than one `ExternalArtifact` back-points at the same producer object, so the target is undefined;
- a **cross-namespace ref rejected** by `--no-cross-namespace-refs`;
- an **API error** reading the source or artifact (RBAC denial, the artifact CRD not installed).

When the failing `sourceRef` targets another namespace, the Message is deliberately scrubbed to `cross-namespace <kind> %q is not reachable` so tenants cannot fingerprint other namespaces — check that source CR's status in its own namespace.

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>
# Ambiguity: are there multiple artifacts pointing at the producer?
kubectl get externalartifact -n <namespace> -o yaml | grep -A3 sourceRef
```

## Remediation

- **Ambiguous producer:** ensure exactly one `ExternalArtifact` back-points at the producer, or reference the `ExternalArtifact` directly by name.
- **Cross-namespace rejected:** move the source into the StageSet's namespace, or run the controller without `--no-cross-namespace-refs` if your [tenancy model](/usage/multi-cluster/) allows it.
- **RBAC / missing CRD:** grant the controller (or the impersonated `serviceAccountName`) read on the source kind, or install the `source-controller` CRDs.


---

# SourceNotReady

Source: https://stageset.projects.metio.wtf/runbooks/sourcenotready/


## Symptom

`READY=False`, `REASON=SourceNotReady`. Transient: the controller requeues and clears the condition once the source publishes.

## Cause

A stage's `sourceRef` resolved to an `ExternalArtifact` (directly, or via a producer's RFC-0012 back-pointer such as a JaaS `JsonnetSnippet`), but that artifact's `status.conditions[Ready]` is not yet `True` — its producer has not finished publishing a revision. The StageSet gates on `Ready=True` so it never builds against a half-written artifact.

## Diagnosis

```shell
# Which artifact, and is it Ready?
kubectl get externalartifact -n <namespace>
kubectl describe externalartifact <name> -n <namespace>

# If the producer is a JsonnetSnippet (or other producer kind), check it:
kubectl describe jsonnetsnippet <name> -n <namespace>
```

## Remediation

This usually clears on its own when the producer publishes. If it persists:

- confirm the producing controller (e.g. the JaaS operator, or [Flux](https://fluxcd.io/) `source-controller`) is running and reconciling the producer object;
- check the producer's own Ready condition for an upstream error (a failed render, an unreachable `GitRepository`/`OCIRepository` source);
- once the producer reports `Ready=True` with a `status.artifact`, the StageSet converges on the next reconcile.

If the artifact never appears at all, the reason is [`ArtifactNotFound`](/runbooks/artifactnotfound/); a spec/API resolution failure is [`ResolveFailed`](/runbooks/resolvefailed/). See [stages and sources](/usage/stages-and-sources/).


---

# StageFailed

Source: https://stageset.projects.metio.wtf/runbooks/stagefailed/


## Symptom

`READY=False`, `REASON=StageFailed`. The Message names the stage and the operation that failed (`fetch artifact`, `build`, `apply`, `verify`, a pre/post action, or `connect to target cluster`). The run halts at that stage; later stages keep their previous revisions.

## Cause

A stage failed during execution. By operation:

- **fetch artifact** — the artifact URL was unreachable, or its bytes failed digest verification.
- **build** — kustomize build or post-build substitution failed (a missing `substituteFrom` source, an invalid patch, a malformed manifest).
- **apply** — the server-side apply was rejected: an immutable-field conflict, or an **RBAC denial** under the impersonated `serviceAccountName`.
- **verify** — applied objects did not become Ready within the stage timeout (kstatus).
- **pre/post action** — a `patch`/`http`/`wait`/`job`/`delete`/`apply` action failed or timed out.
- **connect to target cluster** — a `spec.kubeConfig` Secret was missing, unparseable, or used the unsupported cloud-provider `configMapRef`.

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>     # Message: which stage + operation
kubectl -n stageset-system logs deploy/stageset-controller --tail=200

# For apply/verify failures, inspect what the stage tried to apply:
kubectl get stageinventory -n <namespace> \
  -l stages.metio.wtf/stage-set=<name>,stages.metio.wtf/stage=<stage>
```

## Remediation

Match the operation in the Message:

- **fetch / digest** — confirm the producer republished cleanly; a digest mismatch means the artifact changed mid-flight or is corrupt.
- **build** — validate the manifests/patches locally; ensure every `substituteFrom` ConfigMap/Secret exists.
- **apply RBAC** — grant the impersonated `serviceAccountName` (or the controller) the verbs it was denied; the Message names the resource.
- **apply immutable conflict** — set a per-stage [`conflictPolicy`](/usage/conflict-policies/) (or `force: true`, its blunt `Recreate`-everything form) so the controller deletes and recreates the conflicting object; for objects holding data (`PersistentVolumeClaim`/`PersistentVolume`) a `Recreate` rule additionally requires `allowDataLoss: true`. Alternatively, use content-hash-suffixed names so a change is a new object rather than a mutation.
- **verify timeout** — raise the stage `timeout`, or fix why the workload is not becoming Ready.
- **action** — read the action's error; for `http`, confirm the host is in `--allowed-action-hosts`.

Retries re-run the same pinned snapshot idempotently — actions already recorded in the stage's ledger do not re-fire. See [stages and sources](/usage/stages-and-sources/) for how a stage resolves and applies.


---

# Stalled

Source: https://stageset.projects.metio.wtf/runbooks/stalled/


## Symptom

`READY=False`, `REASON=Stalled`. Terminal: the controller does not requeue until the spec changes.

## Cause

A condition that retrying cannot clear. Currently this is a **`spec.dependsOn` cycle** — two or more StageSets depend on each other (directly or transitively), so none can ever become Ready first. The cycle is detected by a breadth-first walk over the `dependsOn` graph. A dependency that is merely not Ready yet (no cycle) reports [`DependencyNotReady`](/runbooks/dependencynotready/) instead.

## Diagnosis

```shell
kubectl describe stageset <name> -n <namespace>     # Message states "spec.dependsOn forms a cycle"
# Trace the edges:
kubectl get stageset -n <namespace> \
  -o custom-columns=NAME:.metadata.name,DEPENDSON:.spec.dependsOn[*].name
```

Follow the `dependsOn` names until you find the loop (A → B → A, or longer).

## Remediation

Break the cycle by removing one edge — drop a `dependsOn` entry from one StageSet, or restructure so the ordering is a strict chain. Dependencies must form a directed acyclic graph. After the edit, the next reconcile re-walks the graph and clears the condition.


---

# Succeeded

Source: https://stageset.projects.metio.wtf/runbooks/succeeded/


## Symptom

`READY=True`, `REASON=Succeeded`. The Message names the applied revisions.

## Cause

This is the healthy steady state: every stage's artifact resolved, built, applied, and passed its readiness checks, and `status.lastAppliedRevisions` matches `status.lastAttemptedRevisions`.

## Remediation

Nothing to remediate.

- The controller keeps reconciling at `spec.interval`; a re-render upstream (a new `ExternalArtifact` revision) re-applies automatically and the condition stays `Succeeded` once the new revision converges.
- `status.stages[]` reports per-stage `appliedRevision` and inventory entry counts to confirm what each stage owns.


---

# Suspended

Source: https://stageset.projects.metio.wtf/runbooks/suspended/


## Symptom

`READY=False`, `REASON=Suspended`.

## Cause

`spec.suspend: true` is set, so the controller short-circuits before any resolution, build, or apply. This is an intentional operator action, not a failure — applied objects are left exactly as they were at the last successful run.

## Remediation

Resume by clearing the flag:

```shell
kubectl patch stageset <name> -n <namespace> --type=merge -p '{"spec":{"suspend":false}}'
```

The next reconcile picks up from the current artifact revisions.


---

# UpdateDeferred

Source: https://stageset.projects.metio.wtf/runbooks/updatedeferred/


## Symptom

`READY=False`, `REASON=UpdateDeferred` (initial deploy held), or `READY=True` with a message noting a deferral and a populated `status.pendingUpdate` (an already-deployed StageSet with a held update).

## Cause

This is **not a failure** — it is time-based delivery working as configured. A new revision (or the first deploy) is being held because the StageSet's [`spec.updateWindows`](/usage/update-windows/) do not currently permit a rollout: either a `Deny` window is active, or `Allow` windows are declared and none is active right now. With `spec.windowScope: All`, even drift correction is paused while a window is closed.

`status.pendingUpdate` shows the held revisions and `nextWindowOpens` (when delivery resumes); the controller requeues at that boundary.

## Diagnosis

```shell
kubectl get stageset <name> -n <namespace> -o jsonpath='{.status.pendingUpdate}'
kubectl get stageset <name> -n <namespace> -o jsonpath='{.spec.updateWindows}'
```

Confirm the current time (in each window's `timeZone`) against the windows. An already-deployed StageSet stays `Ready=True` — the deployed version keeps running while the update waits.

## Remediation

Usually none — the update applies automatically when the next window opens. If you need it sooner:

- **Force it through once** (e.g. an emergency fix during a freeze):

  ```shell
  kubectl annotate --overwrite stageset <name> -n <namespace> \
    stages.metio.wtf/update-now="$(date +%s)"
  ```

  This applies the held rollout immediately, regardless of windows (one-shot per annotation value).
- **Adjust the windows** if the schedule is wrong — check `type` (Allow vs Deny), the cron `schedule`/`duration` or absolute `from`/`to`, and especially the `timeZone`.


---

# Webhook cert renewal failing

Source: https://stageset.projects.metio.wtf/runbooks/webhook-cert-renewal/


## Symptom

`stageset_webhook_cert_renewal_failures_total` is increasing; the
`StageSetWebhookCertRenewalFailing` alert fires (see
[operations](/installation/operations/) for the alert set and its thresholds).
The current certificate keeps working until its natural expiry — that expiry is
the deadline, after which cluster-wide `StageSet` admission breaks.

## Cause

Only applies in `--webhook-cert-mode=self-signed`. The in-pod renewer regenerates
the serving cert every `validity/3` and patches the
`ValidatingWebhookConfiguration`'s `caBundle`. It fails when:

- the controller lost `update` (or `get`) on the named
  `ValidatingWebhookConfiguration` (`--webhook-validating-config-name`),
- the VWC was renamed and the flag/`resourceNames` weren't updated,
- the cert directory (`--webhook-cert-dir`) became read-only.

In `cert-manager` mode this metric is irrelevant — [cert-manager](https://cert-manager.io/) owns renewal.

## Diagnosis

```shell
kubectl -n stageset-system logs deploy/stageset-controller | grep -i 'cert\|renew\|caBundle'
kubectl get validatingwebhookconfiguration <name> -o jsonpath='{.webhooks[*].clientConfig.caBundle}' | head -c 40
```

## Remediation

- Restore `get`/`update` on the named VWC in the controller's ClusterRole
  (`resourceNames` must include it).
- Fix the `--webhook-validating-config-name` / `--webhook-cert-dir` flags if they
  drifted from the deployed VWC and mount.
- As a longer-term option, switch to `--webhook-cert-mode=cert-manager` so renewal
  is handled by cert-manager.


---

# Workqueue saturation

Source: https://stageset.projects.metio.wtf/runbooks/workqueue-saturation/


## Symptom

`workqueue_depth{controller="stageset"}` stays high; StageSets reconcile slowly or
lag behind their `spec.interval`. The `StageSetControllerWorkqueueDepthHigh` alert
fires (see [operations](/installation/operations/) for the alert set and its
thresholds).

## Cause

The controller is enqueuing reconcile requests faster than it completes them.
Common causes:

- **apiserver slowness** — applies, dry-runs, and status writes all block on the
  apiserver (or the impersonated tenant's authorization).
- **slow sources** — a stage waiting on a large artifact fetch or a source that is
  slow to become Ready holds a worker.
- **a stuck stage** — an action with a long timeout (a `wait`/`http`/`job` that
  never completes) pins a worker for the whole timeout.
- **too few workers for the StageSet count** — many StageSets reconciling on short
  intervals.

## Diagnosis

```shell
# which StageSets are churning?
kubectl get stagesets -A --sort-by=.status.observedGeneration
# controller logs for slow operations / retries
kubectl -n stageset-system logs deploy/stageset-controller --tail=200
```

Correlate with `controller_runtime_reconcile_time_seconds` (see
[reconcile latency](/runbooks/reconcile-latency/)) and apiserver latency.

## Remediation

- Lengthen `spec.interval` on high-churn StageSets that don't need fast
  reconciliation.
- Cap long-running actions with a tighter `timeout` so a stuck action releases its
  worker.
- Adding replicas does **not** help: leader election means only the lease holder
  reconciles, so a second replica is failover, not added throughput
  ([production](/installation/production/#high-availability)). The controller has no
  reconcile-concurrency flag — the levers are reducing load (longer intervals, fewer
  StageSets, fewer objects per stage) and removing the slow operations below.
- Investigate apiserver / tenant-authorization latency if reconciles are uniformly
  slow.


---

# Contributing

Source: https://stageset.projects.metio.wtf/contributing/


Contributions are welcome. These pages cover building and testing the controller,
the checks a pull request must pass, and how releases are cut.

## Developer Certificate of Origin

Every commit must be signed off under the
[Developer Certificate of Origin](https://developercertificate.org/) — it
certifies you wrote the patch or otherwise have the right to contribute it. Add
the sign-off automatically with:

```shell
git commit --signoff
```

This appends a `Signed-off-by: Your Name <you@example.com>` line to the commit
message. The DCO check on each pull request enforces it; unsigned commits block
the merge. Amend an existing commit with `git commit --amend --signoff`.


---

# Building and testing

Source: https://stageset.projects.metio.wtf/contributing/building/


The controller is a standard Go module. With a Go toolchain installed:

```shell
go build ./...
go test -race -cover ./...
```

## Test layers

- **Unit tests** sit next to the code across `internal/...` and `api/v1/`. Several
  are drift gates — e.g. `conditions_test.go` asserts every Ready `Reason` has a
  matching runbook page under `docs/content/runbooks/`.
- **envtest-backed tests** (`envtest_*_test.go`) boot a real kube-apiserver + etcd
  via controller-runtime's `envtest`. They `t.Skip` unless `KUBEBUILDER_ASSETS`
  points at an asset bundle — install it with
  [`setup-envtest`](https://book.kubebuilder.io/reference/envtest.html).
- **Fuzz tests** (`FuzzXxx`) harden the parsing-heavy paths; their seed corpus runs
  as ordinary unit tests, and `-fuzz` fuzzes for real.
- **Kind smoke** scenarios under `hack/smoke/` run the controller end to end
  against a real kind cluster.

## Static analysis

A pull request must be clean under each of these — run them locally before
pushing:

```shell
go vet ./...
go run honnef.co/go/tools/cmd/staticcheck@latest ./...
go run github.com/securego/gosec/v2/cmd/gosec@latest ./...
go run golang.org/x/vuln/cmd/govulncheck@latest ./...
go run mvdan.cc/gofumpt@latest -l .          # empty output == formatted
go run github.com/fe3dback/arch-go@latest    # architecture rules (arch-go.yml)
```

## Containerized dev shell

The toolchain — including the pinned `setup-envtest` assets — is also packaged in
a container via [ilo](https://ilo.projects.metio.wtf/), so you can build and test
without installing anything on the host:

```shell
ilo bash -c 'go test -race -cover ./...'
```


---

# CI and releases

Source: https://stageset.projects.metio.wtf/contributing/ci-and-release/


## Continuous integration

Every pull request runs `verify.yml`, which fans out into one job per concern so a
failure points straight at the cause:

- **test** — `go build` then the full `go test` suite.
- **lint-go** — `go vet`, `staticcheck`, `gosec`, and a `gofumpt` formatting check.
- **vulnerabilities** — `govulncheck` (a reachable advisory is a hard gate).
- **architecture** — `arch-go` against `arch-go.yml`.
- **reuse** — SPDX/REUSE compliance on every file.
- **text linters** — `yamllint`, `actionlint`, `markdownlint`, `typos`.
- **container-image** — a buildx image build plus a Trivy scan.

A single **all-green** job depends on every other job and is the only required
check, so new jobs are covered automatically. A separate `kind-smoke.yml` runs the
operator end to end against a real kind cluster, and `fuzz.yml` exercises the fuzz
targets.

## Releases

Releases are **calendar-based and fully automated** — there is no semver tag to
bump by hand. `release.yml` runs on a Monday cron (and on manual dispatch), and the
version is the run date (`date +'%Y.%-m.%-d'`, e.g. `2026.6.15`). A prepare job
counts commits since the last release; an empty week publishes nothing.

The pipeline is hand-rolled — no goreleaser, no GPG:

- Binaries are cross-compiled with `go build` (`CGO_ENABLED=0`, `-trimpath`,
  `-ldflags`) and archived per platform.
- A multi-arch image is pushed to `ghcr.io/metio/stageset-controller` and signed
  with **cosign keyless** (Fulcio OIDC) by digest.
- The GitHub Release attaches the archives, a `SHA256SUMS` file, and its cosign
  signature; identity is proven by the workflow's OIDC certificate, so there is no
  key to distribute.

The Helm chart lives in the [metio/helm-charts](https://github.com/metio/helm-charts)
repository and vendors this repo's CRDs at each release; its `appVersion` tracks the
binary's releases.