# StageSet Controller — full documentation
> The complete StageSet Controller documentation (https://stageset.projects.metio.wtf/) concatenated for
> LLMs. For a concise link index see https://stageset.projects.metio.wtf/llms.txt.
# StageSet Controller
`stageset-controller` is a [Flux](https://fluxcd.io/) controller for ordered, gated, multi-stage delivery.
Flux's `kustomize-controller` and `helm-controller` apply an artifact in one
shot. That fits most releases, but not one that has to happen in sequence:
install the CRDs before the operator that needs them, run a database migration
before the app that reads the new schema, hold a production rollout until the
canary is healthy, freeze changes during business hours.
A `StageSet` describes a release as an ordered list of stages. Each stage applies
a Flux source — a `GitRepository`, `OCIRepository`, `Bucket`, or an
[`ExternalArtifact`](https://fluxcd.io/flux/components/source/externalartifacts/)
(including one rendered on the fly by a producer like [JaaS](https://jaas.projects.metio.wtf/)) —
waits for it to become healthy, and only then lets the next stage begin. Between
stages, run typed actions (a migration `Job`, an HTTP gate, a wait-for-condition),
gate rollouts behind [update windows](/usage/update-windows/), and run
version-aware [migrations](/usage/versioned-migrations/) when you cross a release boundary.
Everything is reconciled continuously, drift-corrected, and pruned with ApplySet
semantics.
## What a StageSet looks like
The smallest useful StageSet is one stage pointing at one artifact:
```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
name: my-app
namespace: default
spec:
stages:
- name: app
sourceRef:
name: my-app # an ExternalArtifact in this namespace
```
The same shape scales up to a gated rollout:
```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
name: payments
namespace: payments
spec:
serviceAccountName: payments-deployer # every apply is impersonated as this SA
stages:
# 1 ── shared infrastructure: CRDs, namespaces, RBAC
- name: infrastructure
sourceRef:
name: payments-infra # an ExternalArtifact
readyChecks:
checks:
- apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
name: ledgers.payments.example
# 2 ── the application, started only once infrastructure is Ready
- name: application
sourceRef:
name: payments-app
actions:
pre:
- name: db-migrate # runs before the manifests are applied
job:
sourceRef:
name: payments-migrations
post:
- name: smoke-test # stage is Ready only if this passes
http:
url: https://payments.internal/healthz
expectedStatus: [200]
# new revisions roll out only outside the Friday-evening change freeze
updateWindows:
- type: Deny
schedule: "0 17 * * FRI"
duration: 60h
timeZone: Europe/Berlin
```
Stages run top to bottom. `infrastructure` must report Ready (its CRD established)
before `application` is touched; the migration Job runs before the app is applied;
the rollout is held when the change-freeze window is open. Everything is
continuously reconciled — drift is corrected, removed objects are pruned.
## Where to go next
- **[Installation](/installation/)** — install on Kubernetes, then harden for
production and wire up observability.
- **[Usage](/usage/)** — worked examples for every feature, from a single stage
to versioned migrations.
- **[CLI](/cli/)** — `stagesetctl` for previewing (`diff`), rendering (`build`),
and driving (`reconcile`) StageSets.
- **[API reference](/api/)** — every field of every custom resource, explained.
- **[Comparisons](/comparisons/)** — how StageSet relates to Helm, Kustomize,
Tanka, kubecfg, and plain Flux.
- **[Runbooks](/runbooks/)** — symptom → cause → remediation for every status
reason.
## Related projects
`stageset-controller` handles the delivery end and composes with two adjacent
projects, each useful on its own:
- **[JOI](https://github.com/metio/jsonnet-oci-images)** publishes Jsonnet
libraries as single-layer OCI images.
- **[JaaS](https://jaas.projects.metio.wtf/)** evaluates Jsonnet on demand and
publishes the result as a Flux `ExternalArtifact`.
- `stageset-controller` takes those artifacts and rolls them out, in order, with
gates.
JOI and JaaS are not required — a stage reads straight from a `GitRepository`,
`OCIRepository`, or `Bucket`, or from any `ExternalArtifact`, whatever produced
it.
---
# Installation
Source: https://stageset.projects.metio.wtf/installation/
Get `stageset-controller` running on a [Kubernetes](https://kubernetes.io/docs/)
cluster, then keep it healthy in [production](/installation/production/).
---
# Configuration reference
Source: https://stageset.projects.metio.wtf/installation/configuration/
The controller is configured entirely through command-line flags, grouped below
by subsystem. When deployed via the Helm chart you never pass these directly — the
chart sets them from your values and its own defaults; each section notes the Helm
value that drives a flag, and the
[metio/helm-charts](https://github.com/metio/helm-charts/tree/main/charts/stageset-controller)
repo carries the full values reference. For the Helm values worth tuning and the
reasoning behind each, see [Production](/installation/production/#settings-you-can-tune);
for metrics and runbooks, [Operations](/installation/operations/).
## Manager and leader election
| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--health-probe-bind-address` | `:8081` | Address the liveness and readiness probe endpoints bind to. | _chart-managed_ |
| `--leader-elect` | `false` | Enable controller-runtime leader election so only one replica reconciles at a time. Recommended for HA deployments. | `controller.leaderElect` |
The leader-election lease name is fixed at `stageset-controller.stages.metio.wtf`
and is created in the namespace the controller pod runs in.
## Watch scope
| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--watch-namespaces` | _(empty)_ | Comma-separated list of namespaces the controller watches. Empty (the default) means cluster-wide. When set, the manager's cache only observes StageSets and sources in these namespaces — the multi-tenant controller-instances pattern. Falls back to the `STAGESET_WATCH_NAMESPACES` environment variable when the flag is empty. | `controller.watchNamespaces` |
**Environment variable:** `STAGESET_WATCH_NAMESPACES` — comma-separated
namespace list. When `--watch-namespaces` is non-empty the flag takes
precedence. When restricted, the chart pivots RBAC to per-namespace
RoleBindings instead of a cluster-wide ClusterRoleBinding.
## Reconciliation defaults
| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--default-interval` | `10m` | Reconcile cadence for StageSets that omit `spec.interval`. | `controller.defaultInterval` |
| `--inventory-mode` | `hybrid` | Inventory strategy for tracking applied resources: `entries`, `hybrid`, or `applyset`. | `controller.inventoryMode` |
| `--inventory-shard-cap` | `5000` | Maximum number of resource entries per `StageInventory` shard. | `controller.inventoryShardCap` |
| `--no-cross-namespace-refs` | `false` | Deny `sourceRef` and `dependsOn` references that target a different namespace. | `controller.noCrossNamespaceRefs` |
| `--allowed-action-hosts` | _(empty)_ | Host glob allowed for `http` actions; repeatable. Loopback and link-local ranges are always denied unless explicitly listed. | `controller.allowedActionHosts` |
| `--runbook-base-url` | _(empty)_ | URL prefix appended to actionable Ready condition messages as `(runbook: //)`. Empty disables. | `controller.runbookBaseURL` |
## Rollback store — filesystem
The rollback store preserves a copy of each stage's last-applied artifact so
that a rollback can re-apply the previous revision without re-fetching from the
producer. The filesystem backend is appropriate for single-replica deployments or
multi-replica deployments backed by an `RWX` volume.
`--rollback-store-path` and `--rollback-store-s3-endpoint` are mutually
exclusive. Both empty disables the store; rollback falls back to re-fetching the
producer artifact.
| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--rollback-store-path` | _(empty)_ | Filesystem directory (e.g. an RWX PVC mount) for the rollback store. Empty disables the filesystem backend. | `rollbackStore.backend: pvc` |
The file store writes rendered output — including Secret data — in the clear.
The volume must provide encryption at rest (encrypted StorageClass, LUKS, or
cloud-disk encryption).
## Rollback store — S3
Active when `--rollback-store-s3-endpoint` and `--rollback-store-s3-bucket` are
both non-empty.
| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--rollback-store-s3-endpoint` | _(empty)_ | S3-compatible endpoint (`host:port`, e.g. `s3.amazonaws.com` or `minio.minio.svc:9000`). Empty disables the S3 backend. | `rollbackStore.s3.endpoint` |
| `--rollback-store-s3-bucket` | _(empty)_ | S3 bucket for the rollback store. Must already exist. | `rollbackStore.s3.bucket` |
| `--rollback-store-s3-prefix` | _(empty)_ | Optional object-key prefix so the rollback store can coexist with other tenants in one bucket. | `rollbackStore.s3.prefix` |
| `--rollback-store-s3-region` | _(empty)_ | S3 region. Required for AWS multi-region buckets; ignored by most S3-compatible servers. | `rollbackStore.s3.region` |
| `--rollback-store-s3-use-ssl` | `true` | Use HTTPS to talk to the S3 endpoint. Set to `false` only for local MinIO over plain HTTP. | `rollbackStore.s3.useSSL` |
| `--rollback-store-s3-access-key` | _(empty)_ | Static access key. Empty engages minio-go's IAM/IRSA credential discovery chain (env → web-identity → EC2/EKS metadata). | `rollbackStore.s3.existingSecret` |
| `--rollback-store-s3-secret-key` | _(empty)_ | Secret key, paired with `--rollback-store-s3-access-key`. | `rollbackStore.s3.existingSecret` |
| `--rollback-store-s3-session-token` | _(empty)_ | Optional session token for temporary credentials (e.g. IRSA). | `rollbackStore.s3.existingSecret` |
| `--rollback-store-s3-anonymous` | `false` | Skip request signing. For public buckets only. | `rollbackStore.s3.anonymous` |
| `--rollback-store-s3-sse` | `s3` | Server-side encryption for stored objects: `none`, `s3` (SSE-S3), or `kms` (SSE-KMS). The store holds rendered Secret data, so encryption is on by default. Set `none` only for a bucket whose backend cannot honor an SSE header. | `rollbackStore.s3.sse` |
| `--rollback-store-s3-sse-kms-key` | _(empty)_ | KMS key ARN or ID for `--rollback-store-s3-sse=kms`. Empty uses the bucket's default KMS key. | `rollbackStore.s3.sseKmsKeyId` |
## Metrics and health
| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--metrics-bind-address` | `:8080` | Address the controller-runtime Prometheus metrics endpoint binds to. `"0"` disables. | _chart-managed_ |
The metrics endpoint exposes standard `controller_runtime_*` and `workqueue_*`
series alongside the custom `stageset_*` metrics documented in
[Operations](/installation/operations/).
## Webhook and TLS provisioning
The validating admission webhook for `StageSet` is enabled by default. Two TLS
provisioning modes are supported.
| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--enable-webhook` | `true` | Enable the validating admission webhook for `StageSet`. | _chart-managed_ |
| `--webhook-cert-mode` | `cert-manager` | TLS provisioning mode: `cert-manager` (chart renders a `Certificate` CR; cert is mounted from a Secret) or `self-signed` (the controller generates a CA and serving cert in-pod and patches the `ValidatingWebhookConfiguration` `caBundle`). | `webhook.certMode` |
| `--webhook-cert-dir` | `/tmp/k8s-webhook-server/serving-certs` | Directory holding `tls.crt` and `tls.key` for the webhook server. | _chart-managed_ |
| `--webhook-port` | `9443` | Port the validating webhook server binds to. | _chart-managed_ |
| `--webhook-cert-validity` | `8760h` (1 year) | Validity of the self-signed serving cert. The controller rotates it every `validity/3`. | `webhook.*` |
| `--webhook-service-name` | `stageset-controller-webhook` | Kubernetes Service the webhook is reachable through. Used to build cert SANs in `self-signed` mode. | _chart-managed_ |
| `--webhook-service-namespace` | _(empty)_ | Namespace of the webhook Service. Empty falls back to the in-cluster ServiceAccount namespace. | _chart-managed_ |
| `--webhook-validating-config-name` | _(empty)_ | Name of the `ValidatingWebhookConfiguration` whose `caBundle` the controller patches. Required when `--webhook-cert-mode=self-signed`. | _chart-managed_ |
## Gate endpoint
The gate endpoint exposes a read-only HTTP API for Flagger canary stage-gates.
`GET /gate/{namespace}/{stageset}/{stage}` returns `200` when the named stage is
ready to advance and `503` otherwise.
| Flag | Default | Description | Helm value |
|---|---|---|---|
| `--gate-bind-address` | `:8082` | Address for the Flagger stage-gate endpoint. Empty disables the endpoint. | `gate.enabled` |
## Logging
Logging is powered by the controller-runtime `zap` logger. The standard zap
flags (`--zap-log-level`, `--zap-encoder`, `--zap-stacktrace-level`,
`--zap-time-encoding`, and `--zap-devel`) are available and bound to
`flag.CommandLine`; run `stageset-controller --help` to see their current
defaults.
---
# Install on Kubernetes
Source: https://stageset.projects.metio.wtf/installation/kubernetes/
## Prerequisites
- A [Kubernetes](https://kubernetes.io/docs/) cluster with `kubectl` and
[`helm`](https://helm.sh/) configured against it.
- [Flux](https://fluxcd.io/) `source-controller`, specifically the
`ExternalArtifact` API (`source.toolkit.fluxcd.io`). A `StageSet` stage always
resolves to an `ExternalArtifact`, so the CRD must exist. `ExternalArtifact`
lands in Flux **v2.7.0**; install at least that version. The controller also
watches `GitRepository`, `OCIRepository`, and `Bucket` sources for
producer-aware resolution.
- [cert-manager](https://cert-manager.io/), only if you choose the
`cert-manager` webhook certificate mode. The chart defaults to `self-signed`,
which provisions and rotates the admission webhook's TLS in-process and needs
no cert-manager. See [production](/installation/production/#admission-webhook-tls)
for the trade-off.
[JaaS](https://jaas.projects.metio.wtf/), JOI, or any particular artifact
producer are not required to install the controller — those are sources of
`ExternalArtifact`s, wired up per `StageSet`.
## Install with Helm
The controller is distributed as an OCI [Helm](https://helm.sh/) chart. The
deployment manifests live in the chart, not in the controller repository.
```shell
helm upgrade --install stageset-controller \
oci://ghcr.io/metio/helm-charts/stageset-controller \
--namespace stageset-system --create-namespace
```
The container image is `ghcr.io/metio/stageset-controller`; the chart pins the
tag to its own `appVersion` by default.
Every setting referenced across these docs — HA replicas, the rollback store,
webhook mode, NetworkPolicy, the ServiceMonitor, and the rest — is a Helm value.
The [chart's README and `values.yaml`](https://github.com/metio/helm-charts/tree/main/charts/stageset-controller)
document the full, current list.
### What the chart installs
- The **controller `Deployment`**, its `ServiceAccount`, and the cluster RBAC it
needs (a `ClusterRole` + `ClusterRoleBinding`, plus a namespaced leader-election
`Role`/`RoleBinding`).
- The **CRDs** — `StageSet` and `StageInventory`.
- The **validating admission webhook** (`ValidatingWebhookConfiguration` + a
webhook `Service`).
- A **metrics `Service`** (and an opt-in `ServiceMonitor`).
- The **Flagger gate `Service`** for the read-only stage-gate endpoint.
- Opt-in extras: `NetworkPolicy`, `PodDisruptionBudget`,
`HorizontalPodAutoscaler`, a rollback-store `PersistentVolumeClaim`, and a
managed `Namespace`.
### About the CRDs
The CRDs ship inside the chart's regular templates (not Helm's special `crds/`
directory), so a `helm upgrade` applies schema changes like any other resource.
This is governed by `crds.create` (default `true`). The CRDs carry
`helm.sh/resource-policy: keep`, so a `helm uninstall` leaves them — and your
StageSets — in place; remove them by hand if you really mean to.
If you manage CRDs out of band, the raw definitions are also published in the
controller repository under `config/crd/` and can be applied with
`kubectl apply --server-side -f`.
## Verify
```shell
kubectl -n stageset-system get deploy stageset-controller
kubectl get crd stagesets.stages.metio.wtf stageinventories.stages.metio.wtf
```
Once the controller is `Available`, create your first
[StageSet](/usage/stages-and-sources/).
---
# Operations
Source: https://stageset.projects.metio.wtf/installation/operations/
## Metrics
The controller registers custom metrics on the controller-runtime registry, served
on `--metrics-bind-address` (`:8080`) alongside the standard
`controller_runtime_*` and `workqueue_*` series. Enable scraping with the chart's
opt-in `ServiceMonitor` (`metrics.serviceMonitor.enabled`):
```yaml
# values.yaml
metrics:
serviceMonitor:
enabled: true # needs the Prometheus operator CRDs
```
| Metric | Type | Labels | Meaning |
|---|---|---|---|
| `stageset_reconcile_total` | counter | `namespace`, `name`, `reason` | Reconciles, by terminal Ready reason. |
| `stageset_stage_applied_total` | counter | `namespace`, `name`, `stage` | Stages applied and verified. |
| `stageset_drift_corrected_total` | counter | `namespace`, `name`, `stage` | Out-of-band drift re-asserted on a steady-state reconcile. |
| `stageset_update_deferred_total` | counter | `namespace`, `name` | Rollouts held by a closed update window. |
| `stageset_webhook_cert_renewal_failures_total` | counter | _(none)_ | Failed self-signed webhook cert renewals. |
| `stageset_stage_ready` | gauge | `namespace`, `stageset`, `stage` | `1` when a stage is Ready, else `0` — for metric-based [progressive delivery](/tutorials/progressive-delivery/#argo-rollouts). |
## Alerts
The chart ships an opt-in `PrometheusRule` with a starter alert set, gated on
`metrics.prometheusRule.enabled` (requires the
[Prometheus operator](https://prometheus-operator.dev/) CRDs). It covers the
custom `stageset_*` metrics plus controller-runtime signals:
| Alert | Fires on | Severity |
|---|---|---|
| `StageSetReconcileErrorsHigh` | per-StageSet Ready=False rate (excludes the healthy `Succeeded`/`Suspended` reasons) | warning |
| `StageSetControllerWorkqueueDepthHigh` | the reconcile queue not draining | warning |
| `StageSetReconcileLatencyHigh` | reconcile p99 latency over threshold | warning |
| `StageSetControllerPodDown` | a controller pod NotReady | critical |
| `StageSetWebhookCertRenewalFailing` | self-signed cert rotation failing | critical |
Every threshold is a knob under `metrics.prometheusRule.thresholds`, and
`extraAlertLabels` is merged onto every rendered alert so all stageset alerts can
route through one Alertmanager receiver. Each alert carries a `runbook_url`
annotation pointing at the matching [runbook](/runbooks/) page on this site
(`metrics.prometheusRule.runbookBaseURL`); the reconcile-errors alert templates the
URL on `$labels.reason`. Append your own rules under
`metrics.prometheusRule.extraRules`, and silence a built-in alert by raising its
threshold rather than forking the chart.
## Events
The controller emits Kubernetes Events on every Ready-condition transition, so
`kubectl describe stageset ` and [Flux](https://fluxcd.io/)'s
`notification-controller` (via an `Alert` targeting `kind: StageSet`) both
surface what happened. Normal events
include `Succeeded`, `UpdateDeferred`, `MigrationStarted`, and
`MigrationCompleted`; warnings include `StageFailed`, `DriftCorrected`,
`RolledBack`, `MigrationFailed`, `OnFailureAction`, and `RollbackStoreFailed`.
## Runbooks
Every actionable Ready-condition reason has a [runbook](/runbooks/) covering the
symptom, cause, diagnosis, and remediation. Set `--runbook-base-url` (the chart's
`controller.runbookBaseURL`, which defaults to this docs site) to a published copy
of those pages and the controller appends `(runbook: //)` to the
Ready message (the reason lower-cased into a path segment), so a `kubectl describe`
links straight to the fix. Healthy reasons (`Succeeded`, `Suspended`) get no link.
```yaml
# values.yaml — point at your own mirror, or set "" to drop the links
controller:
runbookBaseURL: https://runbooks.internal/stageset
```
For example, a `StageFailed` StageSet then shows:
```text
Message: stage "application" failed: … (runbook: https://runbooks.internal/stageset/stagefailed/)
```
## Forcing a reconcile
The controller reconciles on its `spec.interval`, on source changes, and on
demand. To trigger an out-of-band run, stamp the standard annotation — which is
what `flux reconcile` and [`stagesetctl reconcile`](/cli/reconcile/) do for you:
```shell
kubectl annotate stageset my-app \
reconcile.fluxcd.io/requestedAt="$(date -u +%FT%TZ)" --overwrite
```
The handled token is recorded in `status.lastHandledReconcileAt`.
## Drift correction
On a steady-state reconcile the controller re-asserts the desired state, healing
out-of-band changes to managed objects. Each correction emits a `DriftCorrected`
event and increments `stageset_drift_corrected_total`. Tighten the cadence with
`spec.driftDetectionInterval` when you need faster healing than `spec.interval`.
---
# Production
Source: https://stageset.projects.metio.wtf/installation/production/
## High availability
The controller supports leader-elected HA. Enable leader election and run more
than one replica; only the lease holder reconciles, while every replica answers
admission webhook calls (admission must stay available even on non-leaders).
- Leader election is toggled with `--leader-elect`. The binary defaults it to
`false`, but the **Helm chart enables it by default** (`controller.leaderElect:
true`), so a default install is already lease-guarded even at one replica.
- The lease is named `stageset-controller.stages.metio.wtf` and lives in the
controller's namespace. It uses controller-runtime's default timing (~15 s
lease duration). The lease is **not** released eagerly on shutdown, so after a
rolling update the new leader takes over when the old lease expires — budget a
few seconds of reconcile pause on restart (admission and the gate endpoint are
unaffected).
- Scaling: when the chart's `replicas.max` exceeds `replicas.min` it renders a
`HorizontalPodAutoscaler` (CPU target 80%) and a `PodDisruptionBudget`
(`minAvailable: 1`). At the default 1/1 it sets neither and leaves
`spec.replicas` unmanaged.
The controller watches every namespace by default. Multi-tenancy is enforced per
`StageSet` through impersonation (see below). You can additionally scope the
controller to a namespace set with `controller.watchNamespaces` — one controller
instance per tenant-group — and run it under `cluster-admin` for single-tenant
clusters; both are covered in
[multi-cluster and tenancy](/usage/multi-cluster/).
## Hardening
Each option below is shown as the Helm values that configure it. Several are
already the chart's defaults, shown so you can see what is applied and override
it for a stricter policy.
### Tenant impersonation
The controller never applies your manifests with its own identity. Every cluster
operation for a `StageSet` — building, applying, pruning, running actions — is
performed impersonating the `StageSet`'s `spec.serviceAccountName` (the chart
grants the controller `impersonate`, not write access). A `StageSet` can only do
what its tenant SA permits; an over-broad or missing SA fails closed.
This one lives on the `StageSet`, not in the chart — give every production
`StageSet` a scoped `ServiceAccount`:
```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata: { name: payments, namespace: payments }
spec:
serviceAccountName: payments-deployer # scoped to exactly this release's needs
# …
```
### Pod security context
The chart runs a non-root, read-only-root-filesystem pod with all capabilities
dropped, on a `gcr.io/distroless/static:nonroot` image (no shell or package
manager). These are the rendered defaults:
```yaml
podSecurityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
securityContext:
runAsNonRoot: true
runAsUser: 65532
runAsGroup: 65532
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
seccompProfile:
type: RuntimeDefault
```
### Resource limits
Requests equal limits, so the pod is fully constrained:
```yaml
resources:
cpu: 50m
memory: 256Mi
ephemeralStorage: 32Mi # /tmp and the self-signed cert dir are emptyDirs
```
### Pod-Security Standards namespace
Have the chart create the install namespace with restricted PSS labels:
```yaml
namespace:
create: true
pssLevel: restricted # or: baseline / privileged
```
### Network policy
The gate endpoint is **unauthenticated** (read-only
`GET /gate/{namespace}/{stageset}/{stage}`). Turn on the ingress-only NetworkPolicy
to fence it — and the webhook/metrics ports — to only the peers that need them:
```yaml
networkPolicy:
enabled: true # admits the webhook (9443), metrics (8080), gate (8082)
```
The policy is **ingress-only**, so it does not restrict egress — the controller can
still fetch stage artifacts over HTTP from source-controller (an `ExternalArtifact`
or a `GitRepository`/`OCIRepository`/`Bucket` is served from the same artifact
endpoint). If your cluster default-denies egress, add an egress allowance to
source-controller (and DNS) so those fetches succeed.
### Admission webhook TLS
`webhook.certMode` chooses how the webhook serving certificate is obtained:
```yaml
webhook:
certMode: cert-manager # cert-manager issues + rotates the cert (requires cert-manager)
# certMode: self-signed # chart default: in-pod CA + serving cert, rotated at
# validity/3, with no cert-manager dependency
```
## Reference setups
Two HA shapes — on-prem with shared RWX storage, and AWS/EKS with S3 — over the
same backbone: a leader-elected pair (or trio), a rollback store reachable from
whichever pod holds the lease, cert-manager for the webhook, a `NetworkPolicy`
fencing the unauthenticated gate, and a `ServiceMonitor` if you run Prometheus.
Both run two replicas for [HA](#high-availability) (`replicas.max` above
`replicas.min` also renders a PDB and an HPA) and set
`webhook.certMode: cert-manager`, so [cert-manager](https://cert-manager.io/) must
be installed in the cluster.
### On-prem (RWX storage)
The rollback store gives bit-exact rollbacks that outlive producer GC. With HA
replicas it must be reachable from whichever pod holds the lease, so use a
`ReadWriteMany` PVC on your on-prem storage class — every replica mounts the same
volume.
```yaml
# values-onprem.yaml
replicas:
min: 2 # leader-elected HA; the non-leader still serves admission
max: 3 # > min renders an HPA (CPU 80%) and a PodDisruptionBudget
controller:
leaderElect: true
rollbackStore:
backend: pvc
pvc:
accessModes: [ReadWriteMany]
storageClass: nfs-client # your RWX class (NFS, CephFS, …)
size: 10Gi
webhook:
certMode: cert-manager # requires cert-manager in the cluster
networkPolicy:
enabled: true # fences the unauthenticated gate endpoint
metrics:
serviceMonitor:
enabled: true
```
```shell
helm upgrade --install stageset-controller \
oci://ghcr.io/metio/helm-charts/stageset-controller \
--namespace stageset-system --create-namespace \
-f values-onprem.yaml
```
### AWS / EKS (S3)
On EKS, back the rollback store with S3 and let the controller assume an IAM role
through [IRSA](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)
— no static keys. Annotate the controller's ServiceAccount with the role ARN and
leave the S3 credentials empty; the store's minio-go client picks the role up from
the pod's web-identity token.
```yaml
# values-eks.yaml
replicas:
min: 2
max: 3
controller:
leaderElect: true
serviceAccount:
annotations:
# an IAM role granting s3:GetObject/PutObject/ListBucket/DeleteObject on the bucket
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/stageset-controller
rollbackStore:
backend: s3
s3:
endpoint: s3.eu-west-1.amazonaws.com
bucket: my-org-stageset-rollback
region: eu-west-1
# no existingSecret → credentials come from the IRSA role above
webhook:
certMode: cert-manager
networkPolicy:
enabled: true
metrics:
serviceMonitor:
enabled: true
```
```shell
helm upgrade --install stageset-controller \
oci://ghcr.io/metio/helm-charts/stageset-controller \
--namespace stageset-system --create-namespace \
-f values-eks.yaml
```
### Alongside the other Flux controllers
`stageset-controller` is a [Flux](https://fluxcd.io/) citizen and needs no special
wiring to coexist with `source-controller`, `kustomize-controller`,
`helm-controller`, and `notification-controller`. It reads `ExternalArtifact` (and
the standard `GitRepository`, `OCIRepository`, and `Bucket` sources) from
`source-controller`, and `notification-controller` routes its events through an
`Alert` that targets `kind: StageSet` — no Provider/Alert plumbing of its own.
Install it in its own namespace (e.g. `stageset-system`) next to `flux-system`;
the only cluster-scoped pieces are its CRDs, `ClusterRole`, and webhook
configuration.
### Alongside JaaS
[JaaS](https://jaas.projects.metio.wtf/) renders Jsonnet and publishes the result
as an `ExternalArtifact`, which is what a `StageSet` stage consumes — so the two
compose directly. Reference the artifact by name, or name the producing
`JsonnetSnippet` and let `stageset-controller` resolve it (see
[producer-aware sources](/usage/producer-aware-sources/)). They can share a
cluster and namespace or stay separate; both are reconciled by Flux and both apply
under per-tenant impersonation, so the security model is consistent end to end.
## Settings you can tune
The chart wires the controller; you set Helm values. The set worth thinking about
is below — each row is the value, its default, and when you'd change it.
Everything else the chart configures for you (see
[what the chart manages](#what-the-chart-manages)).
| Helm value | Default | When to change |
|---|---|---|
| `replicas.min` / `replicas.max` | `1` / `1` | Raise both to ≥ 2 for HA; set `max > min` to also render an HPA + PDB. |
| `controller.leaderElect` | `true` | Leave on — harmless at one replica, required for HA. |
| `controller.defaultInterval` | `10m` | The reconcile cadence StageSets inherit when they omit `spec.interval`. Lower for faster drift correction cluster-wide. |
| `controller.inventoryMode` | `hybrid` | `applyset` for ApplySet-native tooling; `entries` to drop the ApplySet labels. |
| `controller.inventoryShardCap` | `5000` | Lower only if a stage applies a huge object count and you want smaller inventory objects. |
| `controller.allowedActionHosts` | `[]` | Add host globs your `http` [actions](/usage/actions/) must reach (loopback/link-local are always denied). |
| `controller.noCrossNamespaceRefs` | `false` | `true` to hard-isolate namespaces (deny cross-namespace `sourceRef`/`dependsOn`). |
| `controller.watchNamespaces` | `[]` | Restrict the controller to a namespace list (cache + RBAC pivot to per-namespace bindings); empty watches cluster-wide. See [tenancy](/usage/multi-cluster/#scoping-the-controller-to-a-namespace-set). |
| `rbac.clusterAdmin` | `false` | `true` on **single-tenant** clusters to bind the controller SA to `cluster-admin` so StageSets apply without `serviceAccountName`. See [single-tenant](/usage/multi-cluster/#single-tenant-cluster-admin). |
| `controller.runbookBaseURL` | the docs site | Point at a fork/mirror, or empty to drop the runbook links from Ready messages. |
| `webhook.certMode` | `self-signed` | `cert-manager` if you run cert-manager — see [reference setups](#reference-setups). |
| `gate.enabled` | `true` | Leave on for [progressive delivery](/tutorials/progressive-delivery/) (the Flagger/Argo gate); set `false` to drop the gate Service and endpoint. |
| `rollbackStore.backend` | `none` | `pvc` (RWX) or `s3` to enable [`spec.rollbackOnFailure`](/usage/rollback/); the two are mutually exclusive. |
| `rollbackStore.s3.sse` | `s3` | At-rest encryption for the S3 store (it holds rendered Secret data): `s3` (SSE-S3), `kms` (+`sseKmsKeyId`), or `none`. See [encryption at rest](/usage/rollback/#encryption-at-rest). |
| `networkPolicy.enabled` | `false` | `true` to fence the controller and the unauthenticated gate. |
| `metrics.serviceMonitor.enabled` | `false` | `true` if you scrape with the Prometheus operator. |
| `metrics.prometheusRule.enabled` | `false` | `true` for the bundled [alerts](/installation/operations/#alerts). |
| `serviceAccount.annotations` | `{}` | An IRSA role ARN on EKS so the S3 store uses an IAM role. |
| `namespace.create` | `false` | `true` to have the chart create the install namespace with Pod-Security labels. |
| `resources` | requests = limits | Raise for very large or very busy releases. |
Every option is set the same way — in your values file, applied with
`helm upgrade --install … -f values.yaml`. The [reference setups](#reference-setups)
above are complete, copy-pasteable examples.
## What the chart manages
You do **not** configure these — the chart wires them so the controller behaves
correctly out of the box:
- **Leader election and HA plumbing** — the lease, and the PDB/HPA when
`replicas.max > replicas.min`.
- **The admission webhook** — the server, its Service, the
`ValidatingWebhookConfiguration`, and the certificate (cert-manager `Certificate`
or the in-pod self-signed renewer, per `webhook.certMode`).
- **Endpoints** — metrics, health probes, and the gate, on their Services.
- **RBAC** — the ClusterRole/bindings the controller needs, including the
`impersonate` verb (it never applies as itself).
- **A hardened pod** — non-root, read-only root filesystem, dropped capabilities,
seccomp `RuntimeDefault` (see [pod security context](#pod-security-context)).
- **Per-tenant impersonation** — every apply runs as the StageSet's
`spec.serviceAccountName`.
## Controller flags
The chart sets the controller's command-line flags from your Helm values and its
own defaults — you never pass them directly. For the exhaustive per-flag list with
defaults, see the [Configuration reference](/installation/configuration/), which
also notes which Helm value drives each one.
---
# Tutorials
Source: https://stageset.projects.metio.wtf/tutorials/
End-to-end walkthroughs that stitch several pieces together. Where the
[usage](/usage/) pages each cover one feature in isolation, these follow a whole
task from start to finish.
---
# From Jsonnet to a gated rollout
Source: https://stageset.projects.metio.wtf/tutorials/jsonnet-to-rollout/
This tutorial follows a complete delivery: write [Kubernetes](https://kubernetes.io/docs/)
manifests in [Jsonnet](https://jsonnet.org/) and publish the source through
[Flux](https://fluxcd.io/); [JaaS](https://jaas.projects.metio.wtf/) renders it into a
Flux `ExternalArtifact`, and a StageSet rolls it out with a readiness gate.
The chain is:
```text
Jsonnet in Git/OCI/Bucket → JaaS (JsonnetSnippet) → ExternalArtifact → StageSet
```
This tutorial renders *Jsonnet*, so it goes through JaaS: JaaS turns the Jsonnet
into an `ExternalArtifact` the stage consumes. (If your manifests were already plain
YAML, a stage could read a `GitRepository`/`OCIRepository`/`Bucket` directly — see
[Stage sources](/tutorials/flux-sources/). The renderer is here because the input is
Jsonnet, not because StageSet can't read Git.)
## Prerequisites
- Flux installed (with the `ExternalArtifact` API — Flux ≥ v2.7.0).
- [JaaS](https://jaas.projects.metio.wtf/) installed in operator mode.
- StageSet installed (see [Installation](/installation/kubernetes/)).
- An `apps` namespace, and a `web-deployer` `ServiceAccount` in it whose RBAC can
apply the workload (the StageSet impersonates it):
```shell
kubectl create namespace apps
kubectl -n apps create serviceaccount web-deployer
# bind web-deployer to a Role/ClusterRole that can manage Deployments and
# Services in the apps namespace — see /usage/multi-cluster/ for the tenancy model
```
## 1. Write the manifests in Jsonnet
A small web app, parameterized as a Jsonnet top-level function so the same source
renders for any environment. Commit this as `jsonnet/main.jsonnet` in a Git repo:
```jsonnet
// jsonnet/main.jsonnet
function(name='web', image='registry.internal/web:latest', replicas='2') {
apiVersion: 'v1',
kind: 'List',
items: [
{
apiVersion: 'apps/v1',
kind: 'Deployment',
metadata: { name: name },
spec: {
replicas: std.parseInt(replicas),
selector: { matchLabels: { app: name } },
template: {
metadata: { labels: { app: name } },
spec: { containers: [{ name: name, image: image }] },
},
},
},
{
apiVersion: 'v1',
kind: 'Service',
metadata: { name: name },
spec: { selector: { app: name }, ports: [{ port: 80, targetPort: 8080 }] },
},
],
}
```
Rendering a `kind: List` keeps several resources in one document — both the
kustomize build the controller runs and `kubectl` flatten it transparently.
## 2. Publish the source through Flux
Point a Flux `GitRepository` at the repo so the cluster has the Jsonnet:
```yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: web-manifests
namespace: apps
spec:
interval: 1m
url: https://github.com/acme/web-manifests
ref:
branch: main
```
Apply it and wait for the source to sync:
```shell
kubectl apply -f gitrepository.yaml
kubectl -n apps wait --for=condition=Ready gitrepository/web-manifests
```
## 3. Render with JaaS
A `JsonnetSnippet` reads the Jsonnet from that source, passes the parameters as
top-level arguments, and publishes the rendered result as an `ExternalArtifact`:
```yaml
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
metadata:
name: web
namespace: apps
spec:
sourceRef:
kind: GitRepository
name: web-manifests
path: ./jsonnet
entryFile: main.jsonnet
tlas: # top-level args → the function() parameters
name: ["web"]
image: ["registry.internal/web:2.1.0"]
replicas: ["3"]
```
Apply it; JaaS then publishes an `ExternalArtifact` named `web` in the `apps`
namespace. Confirm it went Ready:
```shell
kubectl apply -f jsonnetsnippet.yaml
kubectl -n apps get externalartifact web
```
## 4. Roll it out with StageSet
Reference the `JsonnetSnippet` as the stage source — StageSet resolves the
producer to its `ExternalArtifact` — and gate the stage on the Deployment becoming
available:
```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
name: web
namespace: apps
spec:
serviceAccountName: web-deployer # applies are impersonated as this SA
stages:
- name: web
sourceRef:
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
name: web
readyChecks:
checks:
- apiVersion: apps/v1
kind: Deployment
name: web
```
Apply it, preview the change before it lands, then watch it roll out:
```shell
kubectl apply -f stageset.yaml
stagesetctl diff web -n apps # preview against live cluster state
stagesetctl get web -n apps # per-stage progress
```
## 5. Ship a change
Edit `jsonnet/main.jsonnet` (or bump the `image` TLA on the snippet) and commit.
Flux pulls the new commit, JaaS re-renders and republishes the `ExternalArtifact`,
and StageSet — watching the producer — reconciles the new revision through the
same gate. No StageSet edit required.
### No labels or annotations needed
You do **not** annotate or label anything to make this chain fire. The linkage is
the `sourceRef` itself: the controller watches the source *kinds* (`ExternalArtifact`,
`GitRepository`, `OCIRepository`, `Bucket`, and producers like `JsonnetSnippet`) and,
when one changes, maps it back to every StageSet whose `sourceRef` points at it — then
reconciles those. JaaS works the same way for a snippet's own `sourceRef` and
library references. Discovery is automatic; you only declare the references.
## Versioning the rollout
To gate one-time [migrations](/usage/versioned-migrations/) on a release boundary,
declare the version. The simplest is to pin it on the StageSet, bumped alongside the
image:
```yaml
spec:
version:
value: "2.1.0"
migrations:
- name: backfill-2-0
to: "2.0.0" # runs once when the deployed version crosses 2.0.0
stage: web
actions:
- name: backfill
job:
sourceRef:
name: web-migrations
stages:
- name: web
sourceRef:
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
name: web
```
### Let the version travel with the rendered manifests
Pinning works, but the cleaner pattern is to let the version ride *inside* the
manifests the snippet renders — so a single value flows from your CI all the way to
the rollout gate. Feed the version into the snippet and stamp it onto the standard
`app.kubernetes.io/version` label (and the image tag, from the same value):
```jsonnet
// web.jsonnet
local version = std.extVar('version'); // supplied by JaaS extVars / your CI
{
apiVersion: 'apps/v1',
kind: 'Deployment',
metadata: {
name: 'web',
labels: { 'app.kubernetes.io/version': version }, // ← the version, in the manifest
},
spec: {
template: {
metadata: { labels: { 'app.kubernetes.io/version': version } },
spec: { containers: [{ name: 'web', image: 'registry.example/web:' + version }] },
},
},
}
```
Then point `version.fromObject` at that object and drop the inline `value` — the
controller reads the label off the rendered `Deployment`:
```yaml
spec:
version:
fromObject:
stage: web
kind: Deployment
name: web
# defaults to the app.kubernetes.io/version label
migrations:
- name: backfill-2-0
to: "2.0.0"
stage: web
actions:
- name: backfill
job:
sourceRef:
name: web-migrations
stages:
- name: web
sourceRef:
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
name: web
```
Now the version has exactly one source of truth — the value your pipeline feeds the
snippet — and it shows up in the image tag, the version label, *and* the migration
gate together. The same `fromObject` works for a `GitRepository`/`OCIRepository`
source too; only a source that ships a dedicated file wants
[`version.fromArtifact`](/usage/versioned-migrations/#from-a-file-in-the-artifact--versionfromartifact)
instead. See [versioned migrations](/usage/versioned-migrations/) for all three.
## Next
From here, add more [stages](/usage/stages-and-sources/), pre/post
[actions](/usage/actions/), or [update windows](/usage/update-windows/) to turn
this single rollout into a gated, multi-stage release. To parameterize per
environment, see [Parameters](/tutorials/parameters/).
---
# Parameterizing a rollout
Source: https://stageset.projects.metio.wtf/tutorials/parameters/
A rollout takes parameters at two distinct layers, which serve different purposes:
- **Render-time parameters (JaaS).** Change *what gets rendered*. The Jsonnet
computes its output from top-level arguments (`tlas`) and external variables
(`externalVariables`). Different values produce a different `ExternalArtifact`.
- **Delivery-time parameters (StageSet `postBuild`).** Inject values *into
already-rendered manifests*, per stage, by string substitution — the same
mechanism Flux's `kustomize-controller` uses.
Use render-time parameters for structural logic; use delivery-time parameters to
stamp environment-specific values onto a shared artifact.
## Render-time: JaaS TLAs and external variables
Top-level arguments map to a Jsonnet `function(...)`:
```jsonnet
// main.jsonnet
function(name='web', replicas='2')
{ apiVersion: 'apps/v1', kind: 'Deployment', metadata: { name: name },
spec: { replicas: std.parseInt(replicas) /* … */ } }
```
```yaml
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
metadata:
name: web
namespace: apps
spec:
sourceRef: { kind: GitRepository, name: web-manifests, path: ./jsonnet }
tlas: # → function(name, replicas)
name: ["web"]
replicas: ["3"]
externalVariables: # → std.extVar('environment')
environment: "production"
```
`tlas` is a map of name → list of values (a single-element list for a scalar
argument; multiple values become a JSON array). `externalVariables` are plain
strings read with `std.extVar`.
## Delivery-time: StageSet postBuild substitution
When the rendered manifests carry `${var}` placeholders, a stage substitutes them
at apply time — from inline values, ConfigMaps, and Secrets:
```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
name: web
namespace: apps
spec:
stages:
- name: web
sourceRef:
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
name: web
postBuild:
substitute:
cluster_name: prod-eu
substituteFrom:
- kind: ConfigMap
name: cluster-vars
- kind: Secret
name: cluster-secrets
optional: true
```
A manifest field like `value: "${cluster_name}"` becomes `value: "prod-eu"` for
this stage.
## Reusing one artifact across environments
The two layers combine into a common pattern: render an environment-*agnostic*
artifact once with JaaS, then have several StageSets — one per environment —
consume that same artifact and stamp their own values with `postBuild`:
```yaml
# staging
spec:
stages:
- name: web
sourceRef: { apiVersion: jaas.metio.wtf/v1, kind: JsonnetSnippet, name: web }
postBuild:
substituteFrom:
- { kind: ConfigMap, name: staging-vars }
---
# production (same artifact, different values)
spec:
stages:
- name: web
sourceRef: { apiVersion: jaas.metio.wtf/v1, kind: JsonnetSnippet, name: web }
postBuild:
substituteFrom:
- { kind: ConfigMap, name: production-vars }
```
One render, many environments — each StageSet bounded by its own
[ServiceAccount](/usage/multi-cluster/) and gated by its own
[actions](/usage/actions/) and [update windows](/usage/update-windows/).
---
# Progressive delivery
Source: https://stageset.projects.metio.wtf/tutorials/progressive-delivery/
`StageSet` integrates with both progressive-delivery controllers:
[Flagger](https://flagger.app/) and
[Argo Rollouts](https://argoproj.github.io/argo-rollouts/). The controller exposes
a read-only gate endpoint and a readiness gauge so either one can hold a promotion
until a `StageSet` stage is healthy; ready checks let a stage wait on a Rollout in
return. Pick the section for your controller below — see also
[StageSet vs Argo Rollouts](/comparisons/argo-rollouts/).
## The gate contract
The gate endpoint backs the Flagger integration and the Argo Rollouts JSON-metric
option.
```text
GET /gate/{namespace}/{stageset}/{stage}
200 — the stage is Ready at the currently pinned revision
403 — the stage is not Ready (or not found / not gateable)
```
It is served on `--gate-bind-address` (default `:8082`) and exposed by the chart's
`stageset-controller-gate` Service (`gate.enabled`, on by default). The endpoint is
**unauthenticated and read-only**, so fence it with a `NetworkPolicy`
([production](/installation/production/#network-policy)) to admit only your
delivery controller.
## Flagger
Add a `confirm-promotion` (or `confirm-rollout`) webhook to a Flagger `Canary`
pointing at the gate. Flagger blocks the promotion until the gate returns `200`:
```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: web
namespace: apps
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web
analysis:
interval: 1m
threshold: 5
stepWeight: 10
maxWeight: 50
webhooks:
- name: stageset-stage-ready
type: confirm-promotion
# gate this canary's promotion on a StageSet stage being Ready
url: http://stageset-controller-gate.stageset-system:8082/gate/apps/web/web
```
This is independent of the Flagger *strategy*: the same webhook gates a weighted
**canary**, an **A/B test** (header/cookie routing), or a **blue-green** promotion
— the gate only answers "is this stage Ready," and Flagger decides what to do with
that answer.
This coordinates two moving parts: Flagger shifts traffic to a new version only once
a StageSet stage that applied the supporting config (a CRD, a migration, a sibling
component) reports Ready.
## Argo Rollouts
Argo Rollouts gates on **analysis metrics** (a query that returns a value to
compare) rather than a webhook's HTTP status, so the controller meets it on its own
terms in two ways.
### Gate on the readiness gauge (recommended)
The controller exports `stageset_stage_ready{namespace,stageset,stage}` (`1` when
the stage is Ready, `0` otherwise). Argo's **Prometheus** metric provider gates on
it directly — no gate endpoint, no Job:
```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: stageset-stage-ready
namespace: apps
spec:
metrics:
- name: stage-ready
successCondition: result == 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: max(stageset_stage_ready{namespace="apps",stageset="web",stage="web"})
```
### Gate on the JSON endpoint
The same gate endpoint also answers JSON when asked
(`Accept: application/json`), returning `{"ready": true, …}` with a `200` so Argo's
**web** metric can parse it (Argo treats a non-2xx as an error, so readiness has to
live in the body):
```yaml
spec:
metrics:
- name: stage-ready
successCondition: "result.ready == true"
provider:
web:
url: http://stageset-controller-gate.stageset-system:8082/gate/apps/web/web
headers:
- key: Accept
value: application/json
jsonPath: "{$}"
```
A **Job-based metric** (`curl -fsS …` against the gate, succeeding only on `200`)
is the fallback when the analysis has no Prometheus or web access.
## The reverse direction: gate a StageSet on a Rollout
The coordination also works the other way. Because
[ready checks](/usage/ready-checks/) accept CEL, a StageSet stage can wait on an
Argo `Rollout` finishing its own progressive rollout before the next stage runs:
```yaml
readyChecks:
exprs:
- apiVersion: argoproj.io/v1alpha1
kind: Rollout
current: "status.phase == 'Healthy'"
inProgress: "status.phase in ['Progressing', 'Paused']"
failed: "status.phase == 'Degraded'"
```
So StageSet can gate Argo (via the gauge/gate) and Argo's outcome can gate
StageSet (via ready checks) — pick whichever direction your release needs.
---
# Quickstart
Source: https://stageset.projects.metio.wtf/tutorials/quickstart/
This tutorial takes you from an empty cluster to one running StageSet. The path
is the shortest one — a single stage pointing directly at a Flux
`GitRepository` that already holds plain manifests. No Jsonnet, no migrations,
no optional knobs.
## Prerequisites
- A [Kubernetes](https://kubernetes.io/docs/) cluster with `kubectl` configured
against it.
- `helm` 3.x.
- [Flux](https://fluxcd.io/) **v2.7.0 or newer** — the `ExternalArtifact` CRD a
stage resolves to lands in that version. See
[Install on Kubernetes](/installation/kubernetes/#prerequisites) for the full
prerequisites.
## Step 1 — Install the controller
```shell
helm upgrade --install stageset-controller \
oci://ghcr.io/metio/helm-charts/stageset-controller \
--namespace stageset-system --create-namespace \
--wait --timeout 5m
```
See [Install on Kubernetes](/installation/kubernetes/) for the full list of chart
values (HA replicas, rollback store, webhook TLS mode, and so on).
Verify the controller is running:
```shell
kubectl -n stageset-system get deploy stageset-controller
# NAME READY UP-TO-DATE AVAILABLE AGE
# stageset-controller 1/1 1 1 30s
```
## Step 2 — Provide a source
A stage reads from a Flux source. The quickest path is a `GitRepository`
pointing at a repo that contains plain Kubernetes manifests:
```shell
cat <= 3"
timeout: 5m
```
### `patch`
Patch an existing object — flip a feature flag, scale something, annotate. `type`
is `merge` (default) for a strategic-merge patch, or `json6902` for a JSON Patch:
```yaml
- name: enable-traffic
patch:
target:
apiVersion: v1
kind: Service
name: web
type: merge # default; or json6902
patch: |
{ "spec": { "selector": { "release": "green" } } }
```
### `delete`
Remove an existing object; a missing object counts as success.
```yaml
- name: drop-old-job
delete:
target:
apiVersion: batch/v1
kind: Job
name: legacy-migration
```
### `apply`
Apply transient, rollout-scoped manifests that are **not** inventory-tracked and
are never pruned — a maintenance page, a one-shot canary, a temporary config. With
`wait: true` the action blocks until the applied objects report Ready (kstatus),
bounded by the action `timeout`, so a following `patch` can repoint traffic only
once the resource is serving.
Because the applied objects are never pruned by the inventory diff, stand a
resource up only for the duration of a rollout by pairing an `apply` in `pre` with
a matching `delete` in `post`, and guard a mid-run crash with an `onFailure`
delete:
```yaml
actions:
pre:
- name: stand-up-maintenance-page
apply:
sourceRef:
name: maintenance-page # an ExternalArtifact holding a Pod + Service
wait: true # block until it is serving
post:
- name: tear-down-maintenance-page
delete:
target:
apiVersion: v1
kind: Pod
name: maintenance-page
onFailure:
- name: tear-down-maintenance-page-on-failure
delete:
target:
apiVersion: v1
kind: Pod
name: maintenance-page
```
The action ledger gates each step per pinned revision, so a retry or controller
restart never re-applies or re-deletes the resource for the same snapshot.
To run a `job` action only when the deployed version crosses a release boundary,
see [versioned migrations](/usage/versioned-migrations/).
---
# Conflict policies
Source: https://stageset.projects.metio.wtf/usage/conflict-policies/
Conflict policies decide what happens when an apply hits an immutable-field
conflict — a changed `clusterIP`, a `Job` pod template, a `StorageClass` field
that can't be updated in place. By default the controller fails the stage and
reports it, so nothing destructive happens by surprise. A policy opts specific
resources into automatic resolution.
## The three actions
- `Fail` — stop and report (the default; safest).
- `Recreate` — delete and re-create the object to get past an immutable-field
change.
- `KeepExisting` — leave the live object as-is and move on.
## A default for the whole stage
```yaml
spec:
stages:
- name: app
sourceRef:
name: my-app
conflictPolicy:
default: Fail # explicit; the safe default
```
The `force: true` shorthand on a stage is equivalent to
`conflictPolicy.default: Recreate`.
## Per-resource rules
Rules recreate exactly the resources that need it while everything else stays
`Fail`. A rule's `target` is a partial selector — any field you omit matches
everything. Rules are evaluated in list order; the **first** rule whose target
matches wins, and an object matching no rule falls back to `default`.
```yaml
conflictPolicy:
default: Fail
rules:
# a Job's pod template is immutable — recreate it on change
- target:
apiVersion: batch/v1
kind: Job
action: Recreate
# never fight an HPA over replica counts
- target:
kind: Deployment
name: web
action: KeepExisting
```
## Recreating storage
Recreating a `PersistentVolumeClaim` or `PersistentVolume` destroys data, so a
`Recreate` **rule** targeting one is refused unless you explicitly accept the loss:
```yaml
rules:
- target:
kind: PersistentVolumeClaim
name: scratch
action: Recreate
allowDataLoss: true # required for PVC/PV Recreate, refused otherwise
```
Without `allowDataLoss: true`, a `Recreate` rule targeting a PVC/PV is rejected —
a guardrail against accidentally wiping a volume.
---
# Multi-cluster and tenancy
Source: https://stageset.projects.metio.wtf/usage/multi-cluster/
There are two ways to run the controller, and they map onto two different trust
models. Pick the one that matches your cluster:
- **Multi-tenant** — the controller holds no write access of its own and applies
every `StageSet` impersonating that `StageSet`'s `serviceAccountName`. Each
tenant's RBAC bounds what its releases can touch. This is the chart default.
- **Single-tenant** — the cluster has one operator, so per-tenant isolation buys
nothing. Run the controller under its own identity bound to `cluster-admin` and
skip impersonation entirely — the model Flux's `helm-controller` uses in its
default install.
The two sections below set each one up. The optional
[watch scoping](#scoping-the-controller-to-a-namespace-set) narrows *which*
namespaces a multi-tenant controller sees.
## Impersonation (multi-tenant)
The controller never applies your manifests as itself. Set `serviceAccountName`
and every operation for that `StageSet` — build, apply, prune, actions — is
performed impersonating that ServiceAccount. The `StageSet` can do exactly what the
SA's RBAC permits, and nothing more.
```yaml
spec:
serviceAccountName: payments-deployer # all writes impersonate this SA
stages:
- name: app
sourceRef:
name: payments-app
```
Grant the SA only the rights that release needs:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: payments-deployer
namespace: payments
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: edit
subjects:
- kind: ServiceAccount
name: payments-deployer
namespace: payments
```
This is the multi-tenancy model: isolation comes from each `StageSet` being bounded
by its tenant SA, not from the controller's own grant — by default the chart gives
the controller `impersonate` and read access, no blanket write. A `StageSet` with no
`serviceAccountName`, or one bound to a too-narrow SA, fails closed rather than
escalating.
## Single-tenant cluster-admin
On a cluster with a single operator, per-`StageSet` impersonation is friction with
no payoff — there is no other tenant to isolate from. Run the controller the way
Flux's `helm-controller` runs by default: under its own ServiceAccount, bound to
the built-in `cluster-admin` ClusterRole. `StageSet`s then omit `serviceAccountName`
and apply as the controller, which can write any kind cluster-wide.
Turn it on with one Helm value:
```yaml
rbac:
clusterAdmin: true # bind the controller SA to cluster-admin
```
```bash
helm upgrade --install stageset-controller \
oci://ghcr.io/metio/helm-charts/stageset-controller \
-n stageset-system --create-namespace \
--set rbac.clusterAdmin=true
```
`StageSet`s then need nothing tenancy-related — they apply directly:
```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
name: platform
namespace: stageset-system
spec:
stages:
- name: app
sourceRef:
name: platform-app # applied by the controller's cluster-admin identity
```
When `serviceAccountName` is unset and no `kubeConfig` is given, the controller
applies with its own client — so the `cluster-admin` binding is what lets those
`StageSet`s write. The trade-off: every `StageSet` on the cluster has full write
access, so this is for single-tenant clusters only. Leave `rbac.clusterAdmin` at its
default `false` and use [impersonation](#impersonation-multi-tenant) whenever more
than one team shares the cluster. The two mix — a cluster-admin controller still
honors `serviceAccountName` on any `StageSet` that sets it, dropping to that SA's
rights for that release.
## Scoping the controller to a namespace set
By default the controller watches every namespace. To run one controller per
tenant-group instead — disjoint deployments that each see only their own
namespaces — set `controller.watchNamespaces`:
```yaml
controller:
watchNamespaces:
- team-a
- team-b
```
This does two things together:
- **Cache scoping.** The manager's informers only observe `StageSet`s and sources
in the listed namespaces. Resources elsewhere never enter the cache, so the
controller cannot act on them even if RBAC would allow it.
- **RBAC pivot.** The chart stops binding the tenant ClusterRole cluster-wide and
instead renders one `RoleBinding` per listed namespace — defense in depth, so the
apiserver also refuses out-of-scope calls. (The cluster-scoped webhook-caBundle
grant stays a `ClusterRoleBinding`, since a `ValidatingWebhookConfiguration` is
not namespaced.)
Run several releases with disjoint `watchNamespaces` lists to shard the cluster
across independent controller instances. Combine it with impersonation for the
tightest setup: each instance sees only its namespaces, and each `StageSet` is
bounded by its tenant SA.
## Remote clusters
Point a `StageSet` at another cluster with `kubeConfig`, referencing a Secret that
holds a kubeconfig. Combined with `serviceAccountName`, the controller applies to
the remote cluster as the impersonated identity there.
```yaml
spec:
serviceAccountName: payments-deployer
kubeConfig:
secretRef:
name: prod-eu-kubeconfig
# key defaults to "value" (the Flux convention); set it to override
stages:
- name: app
sourceRef:
name: payments-app
```
The Secret is read with the controller's own identity — connecting to the target
cluster is the controller's job — and the kubeconfig payload defaults to the
`value` key. A self-contained kubeconfig is required; `configMapRef`-style
cloud-provider auth is not supported.
Cross-namespace `sourceRef` and `dependsOn` references can be disabled
cluster-wide with the controller's `--no-cross-namespace-refs` flag when you want
hard namespace isolation.
---
# Producer-aware sources
Source: https://stageset.projects.metio.wtf/usage/producer-aware-sources/
[Stages and sources](/usage/stages-and-sources/#source-kinds) covers the two
direct routes — an `ExternalArtifact` (the default `sourceRef.kind`) or a Flux
`GitRepository`/`OCIRepository`/`Bucket`. The third option names the thing that
*produces* an artifact and lets the controller find it. This is useful when an
operator publishes an `ExternalArtifact` from a custom resource (for example
[JaaS](https://jaas.projects.metio.wtf/) rendering Jsonnet).
## Referencing a producer
Set `kind` (and `apiVersion`) to a producer resource, and the controller resolves
it to the `ExternalArtifact` that producer publishes — the one whose
`spec.sourceRef` back-references the producer (matched on group, kind, and name).
For example, a [JaaS](https://jaas.projects.metio.wtf/) `JsonnetSnippet`
renders Jsonnet and publishes an `ExternalArtifact`; reference the snippet and the
controller follows the link:
```yaml
spec:
stages:
- name: dashboards
sourceRef:
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
name: grafana-dashboards
```
The controller also watches the common Flux source kinds (`GitRepository`,
`OCIRepository`, `Bucket`) so a stage re-reconciles when an upstream source
changes.
A producer can itself consume another producer first: a JaaS `JsonnetSnippet` can
render from the artifact another snippet publishes. That chaining happens on the
producer side — see
[chaining snippets](https://jaas.projects.metio.wtf/usage/snippet-sources/#chaining-snippets).
A stage references only the final producer and reads the `ExternalArtifact` it
publishes.
## Related projects
JOI, JaaS, and `StageSet` compose end to end:
- **[JOI](https://github.com/metio/jsonnet-oci-images)** publishes Jsonnet
libraries as single-layer OCI images (usable both as image-volume mounts and as
Flux `OCIRepository` sources).
- **[JaaS](https://jaas.projects.metio.wtf/)** evaluates Jsonnet — optionally
importing those JOI libraries — and publishes the rendered JSON as an
`ExternalArtifact`.
- **`StageSet`** references the `JsonnetSnippet` (or its artifact) and rolls the
result out in ordered, gated stages.
Each project is independently useful; a stage reads straight from a
`GitRepository`, `OCIRepository`, or `Bucket`, or from any `ExternalArtifact`
regardless of what produced it.
---
# Ready checks
Source: https://stageset.projects.metio.wtf/usage/ready-checks/
Ready checks decide when a stage is healthy enough to let the next stage start.
They are purely observational — the controller waits and reports, but takes no
action (active steps are [actions](/usage/actions/)).
By default, with no `readyChecks` block, the controller waits for **every** object
the stage applied to report ready via
[kstatus](https://github.com/kubernetes-sigs/cli-utils/tree/master/pkg/kstatus).
`readyChecks` lets you narrow that to specific objects (`checks`), add custom
health for resources kstatus doesn't understand (`exprs`, [CEL](https://github.com/google/cel-spec)),
bound the wait (`timeout`), or skip it entirely (`disableWait`). `checks` and
`exprs` may be set together.
## Explicit objects
Wait for named objects only — useful when a stage applies many objects but only a
few gate the next stage:
```yaml
spec:
stages:
- name: infrastructure
sourceRef:
name: platform
readyChecks:
timeout: 5m
checks:
- apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
name: ledgers.payments.example
- apiVersion: apps/v1
kind: Deployment
name: ledger-operator
namespace: platform-system
```
## Custom health with CEL
For custom resources kstatus doesn't understand, describe readiness with CEL
expressions. The shape matches `kustomize-controller`'s `healthCheckExprs`, so
expressions are portable.
```yaml
readyChecks:
exprs:
- apiVersion: db.example/v1
kind: Database
current: "status.phase == 'Running'"
inProgress: "status.phase in ['Pending', 'Provisioning']"
failed: "status.phase == 'Failed'"
```
## Opting out
To apply a stage without waiting for readiness (fire-and-forget), disable the
wait:
```yaml
readyChecks:
disableWait: true
```
---
# Rollback
Source: https://stageset.projects.metio.wtf/usage/rollback/
When a run fails, the controller can restore the last successfully-applied artifact
revisions instead of leaving you on a broken release. Rollback is opt-in and needs
somewhere to keep prior revisions.
## Enabling it
```yaml
spec:
rollbackOnFailure: true
stages:
- name: app
sourceRef:
name: my-app
```
On a failed run the controller restores each stage's last-good artifact revision,
best-effort, and emits a `RolledBack` event. The coordinates it restores from are
recorded in `status.lastAppliedSnapshot`.
## The rollback store
Rollback needs the prior revision to still be fetchable, so the controller keeps a
copy in a **rollback store**. Configure one on the controller (cluster-wide), via
either a shared filesystem or S3:
```text
# filesystem (an RWX PersistentVolumeClaim)
--rollback-store-path=/var/lib/stageset/rollback
# or S3-compatible object storage
--rollback-store-s3-endpoint=s3.example.com
--rollback-store-s3-bucket=stageset-rollback
```
The two are mutually exclusive. With no store configured, rollback can only use a
prior revision the producer itself still retains; a dedicated store makes rollback
reliable across producer pruning.
### Encryption at rest
The store keeps each stage's rendered output, which includes any `Secret`'s data —
including [SOPS](https://github.com/getsops/sops)-decrypted values (see
[secrets encryption](/usage/encryption/)). Treat it as sensitive and keep it
encrypted at rest:
- **S3** encrypts by default. `--rollback-store-s3-sse` (chart:
`rollbackStore.s3.sse`) is `s3` (SSE-S3) out of the box; set `kms` with
`rollbackStore.s3.sseKmsKeyId` for SSE-KMS, or `none` only for a backend that
cannot honor an SSE header. A rejected SSE write is non-fatal — it warns via a
`RollbackStoreFailed` event and skips the store write; the rollout still
succeeds.
- **Filesystem** can't encrypt itself — back the PVC with an **encrypted volume**
(an encrypted `StorageClass`, LUKS, or cloud-disk encryption). The controller
logs a reminder at startup when the file store is enabled.
If a restore can't proceed because the previous revision is gone, the run fails
with the `PreviousRevisionUnavailable` reason (see its
[runbook](/runbooks/previousrevisionunavailable/)), and a store problem surfaces as
a `RollbackStoreFailed` event.
---
# Secrets encryption (SOPS)
Source: https://stageset.projects.metio.wtf/usage/encryption/
A stage's source can carry [SOPS](https://github.com/getsops/sops)-encrypted
files — typically a `Secret` whose values are encrypted — and the controller
decrypts them in memory, before building and applying the manifests. This mirrors
Flux's `kustomize-controller` decryption contract, so an existing SOPS-encrypted
repository works unchanged.
Set `spec.decryption` and point it at a Secret holding the keys:
```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
name: payments
namespace: payments
spec:
serviceAccountName: payments-deployer
decryption:
provider: sops # the only provider
secretRef:
name: sops-age # a Secret in this namespace holding the age key
stages:
- name: app
sourceRef:
kind: GitRepository
name: payments-config # contains an encrypted secret.yaml
```
## Walkthrough — age
[age](https://age-encryption.org/) is the simplest key type and needs no external
service. Take a `Secret` from plaintext to a GitOps-safe rollout in four steps.
**1. Generate an age key.** The file holds the private key; the printed `age1…`
line is the public recipient to encrypt to.
```bash
age-keygen -o age.agekey
# public key: age1qz…
```
**2. Encrypt a Secret.** Encrypt only its values, so the file stays a valid
Kubernetes object, then commit `secret.enc.yaml` (never the plaintext):
```yaml
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: payments-db
namespace: payments
stringData:
password: s3cr3t-do-not-commit-plaintext
```
```bash
sops --encrypt --age age1qz… \
--encrypted-regex '^(data|stringData)$' \
secret.yaml > secret.enc.yaml
```
**3. Put the private key in the cluster** under a `.agekey` data entry. Store
`age.agekey` itself somewhere safe — it is the only thing that can decrypt the
Secret.
```bash
kubectl create secret generic sops-age \
--namespace payments \
--from-file=keys.agekey=age.agekey
```
**4. Decrypt on rollout.** Point a `StageSet` at the source holding
`secret.enc.yaml` and set `spec.decryption` (as in the example above). On reconcile
the controller fetches the source, decrypts every SOPS file in memory, builds, and
applies — so the cluster holds the plaintext `payments-db` Secret while Git only
ever held ciphertext. Grant the deployer ServiceAccount read access to the key
Secret (see [tenancy](#how-keys-are-read--tenancy) below).
## Pairing with JaaS-rendered manifests
A realistic app renders its config from Jsonnet with
[JaaS](https://jaas.projects.metio.wtf/) and keeps only its Secret encrypted. The
two compose cleanly because each owns one concern:
- **JaaS renders the non-secret manifests.** It evaluates Jsonnet server-side and
cannot hold secret values: SOPS ciphertext carries a MAC over the whole encrypted
document, so it can't be authored in Jsonnet — and routing plaintext secrets
through a render service is what you are avoiding.
- **The Secret stays SOPS-encrypted in Git**, as in the walkthrough.
- **The controller decrypts and orders both** under one `spec.decryption`:
```yaml
spec:
serviceAccountName: payments-deployer
decryption:
provider: sops
secretRef:
name: sops-age
stages:
- name: secrets # decrypt + apply the SOPS Secret first
sourceRef:
kind: GitRepository
name: payments-secrets
- name: app # then the JaaS-rendered app that mounts it
sourceRef:
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
name: payments-app
```
The `secrets` stage runs first; only once the `Secret` is applied does the `app`
stage roll out the rendered manifests that mount it. The encrypted Secret and the
rendered config live in separate sources, so the Jsonnet author never touches secret
material.
## The fields
- **`provider`** — the decryption backend. Only `sops` is supported.
- **`secretRef.name`** — a Secret in the `StageSet`'s namespace holding the keys,
using the SOPS conventions: age private keys under data entries ending in
`.agekey`, armored PGP private keys under `.asc`. Optional — omit it for a
[cloud-KMS-only](#cloud-kms) setup.
## How keys are read — tenancy
The key Secret is read in the `StageSet`'s namespace **under its
`serviceAccountName`**, exactly like the manifests it applies. A tenant can only
decrypt with key material its own ServiceAccount is allowed to read, so a key in one
namespace is never reachable from another. Grant the deployer SA `get` on the key
Secret:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: payments-deployer-sops
namespace: payments
rules:
- apiGroups: [""]
resources: [secrets]
resourceNames: [sops-age]
verbs: [get]
```
In a [single-tenant cluster-admin](/usage/multi-cluster/#single-tenant-cluster-admin)
install (no `serviceAccountName`), the controller reads the key Secret under its
own identity instead.
## Decryption and the rollback store
Decrypted bytes exist only in memory on the apply path. The one place rendered
output is persisted is the optional [rollback store](/usage/rollback/), which is
**encrypted at rest** (S3 SSE by default; an encrypted volume for the file store) —
so a decrypted `Secret` never lands in plaintext on disk. See
[encryption at rest](/usage/rollback/#encryption-at-rest).
A rollback re-fetches the previous source and **runs decryption again** rather than
restoring plaintext, so the key Secret must still exist when a rollback fires. If
the key was rotated or deleted in the meantime, the rollback **fails closed** with
`PreviousRevisionUnavailable` instead of applying a stale or unreadable Secret — an
encrypted store cannot avoid this, and it is the safe failure direction.
## Cloud KMS
SOPS files encrypted with a cloud KMS key (AWS KMS, GCP KMS, Azure Key Vault, or
HashiCorp Vault) decrypt through the **controller's ambient credentials** — e.g. an
IRSA role on EKS, wired via `serviceAccount.annotations`. No in-cluster key Secret
is needed, so `secretRef` may be omitted for a KMS-only `StageSet`:
```yaml
spec:
decryption:
provider: sops # secretRef omitted; KMS uses the controller's identity
```
One consequence to weigh in a multi-tenant cluster: unlike age (read under the
tenant SA), **cloud KMS uses the controller's identity**, so any `StageSet` can
decrypt a file encrypted with a KMS key the controller's role can access. This
matches Flux's `kustomize-controller`. Scope the controller's KMS grant
accordingly, or use age keys for hard per-tenant isolation.
## What's supported
- **age** keys via `secretRef` — read under the tenant SA. The resource-level
pattern (`--encrypted-regex '^(data|stringData)$'`) is the tested path.
- **PGP** keys via `secretRef` (`.asc` entries) — read under the tenant SA, pure
Go, no `gpg` binary or keyring needed. See [PGP keys](#pgp-keys).
- **Cloud KMS** (AWS/GCP/Azure/Vault) via the controller's ambient credentials.
- **Encrypted files feeding a `secretGenerator`** — an encrypted `.env` (or other
file) referenced by a kustomize `secretGenerator` is decrypted before the build,
so the generated `Secret` carries the plaintext.
- A file with no SOPS metadata passes through untouched, so encrypted and plain
manifests can sit side by side in one source.
## PGP keys
PGP works **tenant-scoped**, like age: put one or more armored private keys in the
`secretRef` Secret under data entries suffixed `.asc`. The data key is decrypted in
pure Go (`ProtonMail/go-crypto`) directly from those keys — **no `gpg` binary, no
GnuPG keyring, and no `GNUPGHOME`** — and the keys are read under the `StageSet`'s
`serviceAccountName`, so a tenant can only use material its ServiceAccount can read.
```bash
# export the armored private key and load it into the key Secret
gpg --export-secret-keys --armor 0xYOURFINGERPRINT > key.asc
kubectl create secret generic sops-keys \
--namespace payments \
--from-file=pgp.asc=key.asc
```
```yaml
spec:
decryption:
provider: sops
secretRef:
name: sops-keys # holds the *.asc private key(s)
```
One Secret can carry both age (`*.agekey`) and PGP (`*.asc`) keys; the right one is
used per file. For a fresh setup, age is simpler and the recommended default, but an
existing PGP-encrypted repository needs no migration.
---
# Stages and sources
Source: https://stageset.projects.metio.wtf/usage/stages-and-sources/
A `StageSet` is an ordered list of stages. Each stage resolves a
[Flux](https://fluxcd.io/) source — a `GitRepository`, `OCIRepository`, `Bucket`,
or an `ExternalArtifact` (the default) — applies its manifests, waits for them to
become healthy, and only then lets the next stage start.
## One stage
The minimum is one stage pointing at one artifact in the same namespace:
```yaml
apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata:
name: my-app
namespace: default
spec:
stages:
- name: app
sourceRef:
name: my-app # an ExternalArtifact
```
`sourceRef.kind` defaults to `ExternalArtifact`, so the common case is a single
line. The controller fetches the artifact, applies every manifest in it, and marks
the stage `Ready` once the applied objects report healthy.
## Source kinds
A `sourceRef` resolves to a Flux artifact three ways. Point it at whichever you
already have:
```yaml
# 1. an ExternalArtifact (the default — kind omitted)
sourceRef:
name: my-app
# 2. a classic Flux source, consumed directly
sourceRef:
kind: GitRepository # or OCIRepository, or Bucket
name: my-app-manifests
# 3. a producer that publishes an ExternalArtifact (resolved via its back-pointer)
sourceRef:
apiVersion: jaas.metio.wtf/v1
kind: JsonnetSnippet
name: my-app
```
`GitRepository`, `OCIRepository`, and `Bucket` carry the same `status.artifact`
contract as `ExternalArtifact`, so the controller reads them directly — no producer
in between. A stage can apply manifests straight from a Git repo or an OCI artifact,
like Flux's own `kustomize-controller`. For the producer case (for example
rendering Jsonnet with [JaaS](https://jaas.projects.metio.wtf/)), see
[producer-aware sources](/usage/producer-aware-sources/).
## Ordered stages
Add more stages and they run top to bottom — each one waits for the previous to be
`Ready`:
```yaml
spec:
stages:
- name: crds # 1 ── install the CRDs first
sourceRef:
name: platform-crds
- name: operator # 2 ── then the operator that needs them
sourceRef:
name: platform-operator
- name: workloads # 3 ── then the workloads it manages
sourceRef:
name: team-workloads
```
This is the core of a `StageSet`: `operator` is never applied until `crds` is
healthy, so the operator never crash-loops waiting for a CRD that isn't there yet.
## Shaping a stage's manifests
A stage can build from a sub-path of the artifact, customize with patches, and
substitute variables — the [kustomize](https://kubectl.docs.kubernetes.io/)-style
surface:
```yaml
spec:
stages:
- name: app
sourceRef:
name: my-app
path: ./overlays/production # build a sub-path of the artifact
prune: true # GC objects that leave this stage (default)
patches:
- patch: |
- op: replace
path: /spec/replicas
value: 6
target:
kind: Deployment
name: web
postBuild:
substitute:
cluster_name: prod-eu
substituteFrom:
- kind: ConfigMap
name: cluster-vars
- kind: Secret
name: cluster-secrets
optional: true
```
- **`path`** builds from a directory inside the artifact (default `./`).
- **`prune`** (default `true`) garbage-collects objects that fall out of the stage
between reconciles, tracked precisely via the stage's
[`StageInventory`](/api/stageinventory/).
- **`patches`** are strategic-merge or JSON6902 patches applied after the build.
- **`postBuild`** substitutes `${var}` references from inline values, ConfigMaps,
and Secrets at delivery time — see [parameterizing a rollout](/tutorials/parameters/)
for the full render-time-vs-delivery-time treatment.
From here, layer on [actions](/usage/actions/) to gate the stage, or
[ready checks](/usage/ready-checks/) to define what "healthy" means.
---
# Update windows
Source: https://stageset.projects.metio.wtf/usage/update-windows/
Update windows gate *when* new artifact revisions roll out, without pausing
reconciliation. Drift correction keeps running; only the rollout of a *new*
revision is held until a window allows it.
## Deny a recurring window
Freeze rollouts during business hours:
```yaml
spec:
stages:
- name: app
sourceRef:
name: my-app
updateWindows:
- type: Deny
schedule: "0 9 * * MON-FRI" # 5-field cron: start of the window
duration: 8h
timeZone: Europe/Berlin
```
A new revision that arrives inside the window is held; `status.pendingUpdate`
records what is waiting and `nextWindowOpens` when it will ship. The controller
emits an `UpdateDeferred` event and increments `stageset_update_deferred_total`.
## Allow-list windows
If any `Allow` window exists, rollouts happen **only** inside an active Allow with
no active Deny — `Deny` always wins. This expresses "only deploy on Tuesday and
Thursday afternoons":
```yaml
updateWindows:
- type: Allow
schedule: "0 14 * * TUE,THU"
duration: 3h
timeZone: America/New_York
```
## A one-off freeze
Absolute windows use `from`/`to` instead of a schedule — for a planned event
freeze:
```yaml
updateWindows:
- type: Deny
from: 2026-12-24T00:00:00Z
to: 2026-12-27T00:00:00Z
```
## What a closed window blocks
`windowScope` controls what a closed window holds back:
- **`Updates`** (default) — hold only the rollout of a *new* artifact revision.
Drift correction keeps re-applying the pinned state, so the live cluster stays
on its last-approved revision but doesn't fall out of sync.
- **`All`** — a hard freeze: also pause drift correction, so the controller
applies nothing at all while the window is closed.
```yaml
windowScope: Updates # default: hold new revisions, keep correcting drift
# windowScope: All # hard freeze: also pause drift correction
```
## Shipping anyway
To push a held rollout through immediately, override the window with
[`stagesetctl`](/cli/):
```shell
stagesetctl reconcile my-app --update-now
```
This stamps the `stages.metio.wtf/update-now` annotation; the honored value is
recorded in `status.lastHandledUpdateOverride`.
---
# Versioned migrations
Source: https://stageset.projects.metio.wtf/usage/versioned-migrations/
Some changes only need to happen once, when you cross a release boundary — a
one-time data backfill on the way to 2.0, a schema conversion between 1.x and 2.x.
Versioned migrations run a ladder of [actions](/usage/actions/) exactly when the
deployed version steps over the boundary, and never again.
Versioning is off until you set `spec.version`.
## Declaring the version
The controller needs to know *what version is currently being deployed*. There are
three ways to declare it; pick by **where the version lives**.
| Source | The version lives… | Best for |
|---|---|---|
| [`version.value`](#inline--versionvalue) | on the `StageSet` | environment-pinned versions, quick starts |
| [`version.fromObject`](#from-a-rendered-object--versionfromobject) | inside the manifests | **any source, including JaaS** — the recommended default |
| [`version.fromArtifact`](#from-a-file-in-the-artifact--versionfromartifact) | a file in the artifact | Git/OCI/Bucket sources that can ship a `VERSION` file |
Whichever you choose, the resolved value is trimmed and parsed as semver (a leading
`v` is accepted). A missing stage/object/file, an empty value, or an unparseable
one fails terminally with the `InvalidVersion` reason (see its
[runbook](/runbooks/invalidversion/)) — a half-versioned system is worse than an
unversioned one.
### Inline — `version.value`
The `StageSet` author pins the version directly. Use this when the version is a
property of the environment rather than of the content, or to get started quickly:
```yaml
spec:
version:
value: "2.1.0" # bump this when you cut a release
```
The trade-off: the version is declared here, not carried by the content, so you
bump it by editing the `StageSet`.
### From a rendered object — `version.fromObject`
The recommended way to let the version travel with the content.
[Kubernetes](https://kubernetes.io/docs/) has a standard place for an
application's version: the `app.kubernetes.io/version` label. Well-formed manifests
set it, so the version is already inside the manifests — `fromObject` reads it back.
This works for every source kind, including a single-document renderer like
[JaaS](https://jaas.projects.metio.wtf/) that has no room for a separate file.
```yaml
spec:
version:
fromObject:
stage: app # which stage's rendered manifests carry it
kind: Deployment # the object to read
name: web
# fieldPath omitted → reads metadata.labels['app.kubernetes.io/version']
stages:
- name: app
sourceRef:
name: my-app
```
The controller builds the `app` stage's manifests (the same render it applies),
finds the `Deployment/web` object, and reads its `app.kubernetes.io/version` label.
Because the label is part of the manifests, the version changes in lockstep with
the content — no second file to keep in sync.
**Reading a different field.** Set `fieldPath` to a kubectl-style JSONPath that
resolves to the bare version string. (It must be the version *only*; a JSONPath
can't split an `image: web:2.1.0` value, so prefer the label.) `apiVersion` is
optional and narrows the match when a `Kind`+`Name` pair would be ambiguous:
```yaml
spec:
version:
fromObject:
stage: app
apiVersion: v1
kind: ConfigMap
name: app-meta
fieldPath: "{.data.version}" # must resolve to a bare semver, e.g. 2.1.0
```
This is the path the [Jsonnet-to-rollout tutorial](/tutorials/jsonnet-to-rollout/)
uses: the snippet renders the version into the manifest's version label, and the
StageSet reads it straight back.
### From a file in the artifact — `version.fromArtifact`
The version travels with the content as a **dedicated file** containing a single
semver. This fits **Git/OCI/Bucket** sources, where you can ship an extra file
beside the manifests. (It does *not* fit JaaS `rendered` output, which is a single
`rendered.json`; use `fromObject` there.)
**Who writes it, and where:** the artifact's producer. For a Git source, commit a
`VERSION` file in the repo; for an OCI/Bucket artifact, include it in the pushed
tree. The file lives at `path` inside the named stage's artifact, relative to the
artifact root:
```text
# VERSION — committed alongside the manifests it versions
2.1.0
```
```yaml
spec:
version:
fromArtifact:
stage: app # which stage's artifact carries the file
path: VERSION # the file's path inside that artifact (cleaned; no leading ./)
stages:
- name: app
sourceRef:
kind: GitRepository
name: my-app
```
The controller fetches the `app` stage's artifact and reads the file at `path`.
## Declaring migrations
Each migration names the boundary it crosses (`to`, optionally `from`), the stage
it anchors before, and the actions to run:
```yaml
spec:
version:
fromArtifact:
stage: app
path: VERSION
migrations:
- name: backfill-ledger-2-0
from: "1.*" # optional: only when coming from a 1.x
to: "2.0.0" # the boundary this migration crosses
stage: app # runs before this stage's pre-actions
actions:
- name: backfill
job:
sourceRef:
name: ledger-backfill-job
stages:
- name: app
sourceRef:
name: my-app
```
When the deployed version crosses from a `1.x` into `2.0.0`, the `backfill` job
runs once, anchored before the `app` stage. The controller tracks progress so a
retry doesn't re-run a completed migration:
- `status.version` — the deployed version, written only after a fully successful
run.
- `status.pendingMigrations` — migrations the next run will execute.
- `status.executedMigrations` — the in-flight ledger for the current transition.
Migrations emit `MigrationStarted` / `MigrationCompleted` events (and
`MigrationFailed` on error). A downgrade that would skip a required migration is
refused with the `DowngradeRequiresMigration` reason — see its
[runbook](/runbooks/downgraderequiresmigration/).
---
# CLI
Source: https://stageset.projects.metio.wtf/cli/
`stagesetctl` previews, renders, and drives StageSets without waiting for the next
reconcile. It speaks to the cluster with your own kubeconfig — nothing about it runs
in-cluster.
Installed on your `PATH` as `kubectl-stageset`, it also works as a kubectl plugin:
`kubectl stageset ` is equivalent to `stagesetctl `.
| Command | Purpose |
|---|---|
| [`get`](/cli/get/) | Print a StageSet's status, or list StageSets. |
| [`build`](/cli/build/) | Render a StageSet's manifests to stdout. |
| [`diff`](/cli/diff/) | Preview what a reconcile would change; usable as a CI gate. |
| [`reconcile`](/cli/reconcile/) | Force an out-of-band reconcile. |
## Global flags
Every command accepts the standard kubectl connection flags
(`genericclioptions.ConfigFlags`): `--kubeconfig`, `--context`, `-n/--namespace`,
`--as`, `--as-group`, `--server`, `--token`, `--request-timeout`, and the rest.
`--version` prints the binary version and commit (` (commit )`).
With no `-n/--namespace`, the command uses the namespace from your current
kubeconfig context, falling back to `default`.
## Exit codes
Every command shares the same baseline:
| Code | Meaning |
|---|---|
| `0` | Success. |
| `2` | Usage or flag error. |
| `3` | Runtime error. |
[`diff`](/cli/diff/) adds one more: it exits `1` when it finds changes (the
`diff(1)` convention), so it can gate a CI pipeline.
---
# stagesetctl build
Source: https://stageset.projects.metio.wtf/cli/build/
Runs the same resolve → fetch → build pipeline the controller uses and writes the
result — a multi-document YAML stream — to stdout. This is what would be applied,
before it is applied. To preview the change against live cluster state instead, use
[`diff`](/cli/diff/).
```text
stagesetctl build NAME [flags]
```
| Flag | Default | Description |
|---|---|---|
| `--stage` | _(all)_ | Render only the named stage(s); repeatable. |
| `--source-dir` | _(none)_ | Use a local artifact tree as `[STAGE=]PATH` instead of fetching from the cluster; repeatable. |
| `--show-secrets` | `false` | Reveal Secret values instead of masking them. |
| `--as-tenant` | `false` | Render impersonating the StageSet's `spec.serviceAccountName` (see [multi-cluster and tenancy](/usage/multi-cluster/)). |
Secret values are masked by default, so the output is safe to paste into a review.
`build` writes YAML unconditionally — there is no output-format flag.
## Example
```shell
stagesetctl build payments --stage application
```
```yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
namespace: payments
spec:
replicas: 6
selector:
matchLabels: {app: web}
template:
metadata:
labels: {app: web}
spec:
containers:
- name: web
image: registry.internal/web:2.1.0
---
apiVersion: v1
kind: Secret
metadata:
name: web-config
namespace: payments
type: Opaque
data:
token: '***' # masked; pass --show-secrets to reveal
```
`--source-dir` makes `build` work offline — point it at the directory an artifact
would have unpacked to and it skips the cluster fetch, for authoring and CI. The
value is `[STAGE=]PATH`: prefix a stage name to target one stage, or give a bare
path to feed every stage that has no entry of its own. Repeat the flag to map
each stage to its own tree:
```shell
# one stage from a local tree
stagesetctl build payments --stage application --source-dir application=./out
# every stage from one tree (bare path), overriding just infrastructure
stagesetctl build payments \
--source-dir ./checkout \
--source-dir infrastructure=./infra-checkout
```
---
# stagesetctl diff
Source: https://stageset.projects.metio.wtf/cli/diff/
By default `diff` performs a
[server-side](https://kubernetes.io/docs/reference/using-api/server-side-apply/)
dry-run apply and exits `1` when there are changes, so it works as a CI gate. It
shows, per object, what a reconcile would create, configure, or delete, plus the
[actions](/usage/actions/) a rollout would run. To see the full rendered manifests
without comparing against the cluster, use [`build`](/cli/build/).
```text
stagesetctl diff NAME [flags]
```
| Flag | Default | Description |
|---|---|---|
| `--stage` | _(all)_ | Diff only the named stage(s); repeatable. |
| `--source-dir` | _(none)_ | Use a local artifact tree as `[STAGE=]PATH`; repeatable. Skips the cluster fetch. |
| `--server-side` | `true` | Server-side dry-run apply diff (needs update/patch RBAC). `false` renders client-side against live objects. |
| `--as-tenant` | `false` | Render and dry-run impersonating `spec.serviceAccountName` (see [multi-cluster and tenancy](/usage/multi-cluster/)). |
| `--show-secrets` | `false` | Reveal Secret values instead of masking. |
| `--show-unchanged` | `false` | Include objects with no change. |
| `--prune` | `true` | Show resources that would be deleted (fell out of inventory). |
| `--color` | `auto` | Colorize output: `auto`, `always`, or `never`. |
| `--exit-code` | `true` | Exit `1` when changes are found. `false` always exits `0` on a clean run. |
## Example
```shell
stagesetctl diff payments
```
```text
--- live
+++ merged
@@ Deployment payments/web @@
spec:
- replicas: 3
+ replicas: 6
- ConfigMap payments/old-feature-flags (pruned: fell out of inventory)
Actions to run:
application:
pre db-migrate job ledger-migrations
post smoke-test http https://payments.internal/healthz
```
Objects that left the stage's [inventory](/api/stageinventory/) show as deletions
(`pruned: …`); pass `--prune=false` to hide them. The trailing `Actions to run`
block lists the [pre/post/onFailure actions](/usage/actions/) a real reconcile
would execute — `diff` never runs them, it only reports them.
A clean run prints nothing and exits `0`; pending changes exit `1`. To inspect
without failing the shell:
```shell
stagesetctl diff payments --color=never --exit-code=false
```
Use `--server-side=false` when you lack apply RBAC and only need a textual
render-versus-live comparison.
---
# stagesetctl get
Source: https://stageset.projects.metio.wtf/cli/get/
With no `NAME`, lists StageSets in the current namespace. With a `NAME`, prints that
StageSet's detail (Ready reason, per-stage phase, revisions, version) — a readable
view of [`StageSet.status`](/api/stageset/#status).
```text
stagesetctl get [NAME] [flags]
```
| Flag | Default | Description |
|---|---|---|
| `-A`, `--all-namespaces` | `false` | List StageSets across all namespaces. |
| `-o`, `--output` | _(table)_ | Output format: empty for the human table, or `yaml` / `json`. |
## Listing
```shell
stagesetctl get -A
```
```text
NAMESPACE NAME READY REASON STAGES VERSION PENDING
payments payments True Succeeded 2/2 2.1.0 -
platform platform True Succeeded 3/3 - -
staging web False StageFailed 1/2 - -
```
`STAGES` is `ready/total`; `PENDING` shows `held until