Production see history edit this page

Talks about: , , and

High availability

The controller supports leader-elected HA. Enable leader election and run more than one replica; only the lease holder reconciles, while every replica answers admission webhook calls (admission must stay available even on non-leaders).

The controller watches every namespace by default. Multi-tenancy is enforced per StageSet through impersonation (see below). You can additionally scope the controller to a namespace set with controller.watchNamespaces — one controller instance per tenant-group — and run it under cluster-admin for single-tenant clusters; both are covered in multi-cluster and tenancy.

Hardening

Each option below is shown as the Helm values that configure it. Several are already the chart’s defaults, shown so you can see what is applied and override it for a stricter policy.

Tenant impersonation

The controller never applies your manifests with its own identity. Every cluster operation for a StageSet — building, applying, pruning, running actions — is performed impersonating the StageSet’s spec.serviceAccountName (the chart grants the controller impersonate, not write access). A StageSet can only do what its tenant SA permits; an over-broad or missing SA fails closed.

This one lives on the StageSet, not in the chart — give every production StageSet a scoped ServiceAccount:

apiVersion: stages.metio.wtf/v1
kind: StageSet
metadata: { name: payments, namespace: payments }
spec:
  serviceAccountName: payments-deployer   # scoped to exactly this release's needs
  # …

Pod security context

The chart runs a non-root, read-only-root-filesystem pod with all capabilities dropped, on a gcr.io/distroless/static:nonroot image (no shell or package manager). These are the rendered defaults:

podSecurityContext:
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault
securityContext:
  runAsNonRoot: true
  runAsUser: 65532
  runAsGroup: 65532
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: [ALL]
  seccompProfile:
    type: RuntimeDefault

Resource limits

Requests equal limits, so the pod is fully constrained:

resources:
  cpu: 50m
  memory: 256Mi
  ephemeralStorage: 32Mi   # /tmp and the self-signed cert dir are emptyDirs

Pod-Security Standards namespace

Have the chart create the install namespace with restricted PSS labels:

namespace:
  create: true
  pssLevel: restricted     # or: baseline / privileged

Network policy

The gate endpoint is unauthenticated (read-only GET /gate/{namespace}/{stageset}/{stage}). Turn on the ingress-only NetworkPolicy to fence it — and the webhook/metrics ports — to only the peers that need them:

networkPolicy:
  enabled: true            # admits the webhook (9443), metrics (8080), gate (8082)

The policy is ingress-only, so it does not restrict egress — the controller can still fetch stage artifacts over HTTP from source-controller (an ExternalArtifact or a GitRepository/OCIRepository/Bucket is served from the same artifact endpoint). If your cluster default-denies egress, add an egress allowance to source-controller (and DNS) so those fetches succeed.

Admission webhook TLS

webhook.certMode chooses how the webhook serving certificate is obtained:

webhook:
  certMode: cert-manager   # cert-manager issues + rotates the cert (requires cert-manager)
  # certMode: self-signed  # chart default: in-pod CA + serving cert, rotated at
  #                          validity/3, with no cert-manager dependency

Reference setups

Two HA shapes — on-prem with shared RWX storage, and AWS/EKS with S3 — over the same backbone: a leader-elected pair (or trio), a rollback store reachable from whichever pod holds the lease, cert-manager for the webhook, a NetworkPolicy fencing the unauthenticated gate, and a ServiceMonitor if you run Prometheus.

Both run two replicas for HA (replicas.max above replicas.min also renders a PDB and an HPA) and set webhook.certMode: cert-manager, so cert-manager must be installed in the cluster.

On-prem (RWX storage)

The rollback store gives bit-exact rollbacks that outlive producer GC. With HA replicas it must be reachable from whichever pod holds the lease, so use a ReadWriteMany PVC on your on-prem storage class — every replica mounts the same volume.

# values-onprem.yaml
replicas:
  min: 2                 # leader-elected HA; the non-leader still serves admission
  max: 3                 # > min renders an HPA (CPU 80%) and a PodDisruptionBudget

controller:
  leaderElect: true

rollbackStore:
  backend: pvc
  pvc:
    accessModes: [ReadWriteMany]
    storageClass: nfs-client     # your RWX class (NFS, CephFS, …)
    size: 10Gi

webhook:
  certMode: cert-manager         # requires cert-manager in the cluster

networkPolicy:
  enabled: true                  # fences the unauthenticated gate endpoint

metrics:
  serviceMonitor:
    enabled: true
helm upgrade --install stageset-controller \
  oci://ghcr.io/metio/helm-charts/stageset-controller \
  --namespace stageset-system --create-namespace \
  -f values-onprem.yaml

AWS / EKS (S3)

On EKS, back the rollback store with S3 and let the controller assume an IAM role through IRSA — no static keys. Annotate the controller’s ServiceAccount with the role ARN and leave the S3 credentials empty; the store’s minio-go client picks the role up from the pod’s web-identity token.

# values-eks.yaml
replicas:
  min: 2
  max: 3

controller:
  leaderElect: true

serviceAccount:
  annotations:
    # an IAM role granting s3:GetObject/PutObject/ListBucket/DeleteObject on the bucket
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/stageset-controller

rollbackStore:
  backend: s3
  s3:
    endpoint: s3.eu-west-1.amazonaws.com
    bucket: my-org-stageset-rollback
    region: eu-west-1
    # no existingSecret → credentials come from the IRSA role above

webhook:
  certMode: cert-manager

networkPolicy:
  enabled: true

metrics:
  serviceMonitor:
    enabled: true
helm upgrade --install stageset-controller \
  oci://ghcr.io/metio/helm-charts/stageset-controller \
  --namespace stageset-system --create-namespace \
  -f values-eks.yaml

Alongside the other Flux controllers

stageset-controller is a Flux citizen and needs no special wiring to coexist with source-controller, kustomize-controller, helm-controller, and notification-controller. It reads ExternalArtifact (and the standard GitRepository, OCIRepository, and Bucket sources) from source-controller, and notification-controller routes its events through an Alert that targets kind: StageSet — no Provider/Alert plumbing of its own. Install it in its own namespace (e.g. stageset-system) next to flux-system; the only cluster-scoped pieces are its CRDs, ClusterRole, and webhook configuration.

Alongside JaaS

JaaS renders Jsonnet and publishes the result as an ExternalArtifact, which is what a StageSet stage consumes — so the two compose directly. Reference the artifact by name, or name the producing JsonnetSnippet and let stageset-controller resolve it (see producer-aware sources). They can share a cluster and namespace or stay separate; both are reconciled by Flux and both apply under per-tenant impersonation, so the security model is consistent end to end.

Settings you can tune

The chart wires the controller; you set Helm values. The set worth thinking about is below — each row is the value, its default, and when you’d change it. Everything else the chart configures for you (see what the chart manages).

Helm valueDefaultWhen to change
replicas.min / replicas.max1 / 1Raise both to ≥ 2 for HA; set max > min to also render an HPA + PDB.
controller.leaderElecttrueLeave on — harmless at one replica, required for HA.
controller.defaultInterval10mThe reconcile cadence StageSets inherit when they omit spec.interval. Lower for faster drift correction cluster-wide.
controller.inventoryModehybridapplyset for ApplySet-native tooling; entries to drop the ApplySet labels.
controller.inventoryShardCap5000Lower only if a stage applies a huge object count and you want smaller inventory objects.
controller.allowedActionHosts[]Add host globs your http actions must reach (loopback/link-local are always denied).
controller.noCrossNamespaceRefsfalsetrue to hard-isolate namespaces (deny cross-namespace sourceRef/dependsOn).
controller.watchNamespaces[]Restrict the controller to a namespace list (cache + RBAC pivot to per-namespace bindings); empty watches cluster-wide. See tenancy.
rbac.clusterAdminfalsetrue on single-tenant clusters to bind the controller SA to cluster-admin so StageSets apply without serviceAccountName. See single-tenant.
controller.runbookBaseURLthe docs sitePoint at a fork/mirror, or empty to drop the runbook links from Ready messages.
webhook.certModeself-signedcert-manager if you run cert-manager — see reference setups.
gate.enabledtrueLeave on for progressive delivery (the Flagger/Argo gate); set false to drop the gate Service and endpoint.
rollbackStore.backendnonepvc (RWX) or s3 to enable spec.rollbackOnFailure; the two are mutually exclusive.
rollbackStore.s3.sses3At-rest encryption for the S3 store (it holds rendered Secret data): s3 (SSE-S3), kms (+sseKmsKeyId), or none. See encryption at rest.
networkPolicy.enabledfalsetrue to fence the controller and the unauthenticated gate.
metrics.serviceMonitor.enabledfalsetrue if you scrape with the Prometheus operator.
metrics.prometheusRule.enabledfalsetrue for the bundled alerts.
serviceAccount.annotations{}An IRSA role ARN on EKS so the S3 store uses an IAM role.
namespace.createfalsetrue to have the chart create the install namespace with Pod-Security labels.
resourcesrequests = limitsRaise for very large or very busy releases.

Every option is set the same way — in your values file, applied with helm upgrade --install … -f values.yaml. The reference setups above are complete, copy-pasteable examples.

What the chart manages

You do not configure these — the chart wires them so the controller behaves correctly out of the box:

Controller flags reference

You never pass these directly — the chart sets them from your Helm values and its own defaults — but it helps to know what each does and which value drives it.

FlagDefaultPurposeDriven by
--metrics-bind-address:8080Prometheus metrics endpoint.chart-managed
--health-probe-bind-address:8081/healthz + /readyz.chart-managed
--leader-electfalseEnable leader election.controller.leaderElect
--inventory-modehybridInventory strategy: entries, hybrid, applyset.controller.inventoryMode
--inventory-shard-cap5000Max entries per StageInventory shard.controller.inventoryShardCap
--allowed-action-hosts(none)Host glob permitted for http actions (loopback/link-local always denied).controller.allowedActionHosts
--no-cross-namespace-refsfalseDeny cross-namespace sourceRef / dependsOn.controller.noCrossNamespaceRefs
--watch-namespaces(all)Comma-separated namespaces to watch; empty is cluster-wide. Falls back to STAGESET_WATCH_NAMESPACES.controller.watchNamespaces
--enable-webhooktrueRun the validating admission webhook.chart-managed
--webhook-cert-modecert-managercert-manager or self-signed.webhook.certMode
--webhook-cert-dir/tmp/k8s-webhook-server/serving-certsWhere tls.crt/tls.key live.chart-managed
--webhook-port9443Webhook server port.chart-managed
--webhook-cert-validity8760hSelf-signed cert lifetime (rotates at ⅓).webhook.*
--webhook-service-namestageset-controller-webhookService used to build cert SANs.chart-managed
--webhook-service-namespace(in-cluster ns)Namespace of the webhook Service.chart-managed
--webhook-validating-config-name(none)The ValidatingWebhookConfiguration to patch.chart-managed
--gate-bind-address:8082Read-only gate endpoint; empty disables.gate.enabled
--runbook-base-url(none)URL prefix on actionable Ready messages.controller.runbookBaseURL
--rollback-store-path(none)Filesystem (RWX PVC) rollback store.rollbackStore.backend: pvc
--rollback-store-s3-*(off)S3-compatible rollback store.rollbackStore.backend: s3
--rollback-store-s3-sses3At-rest encryption for the S3 store: none, s3, kms.rollbackStore.s3.sse

The PVC and S3 rollback stores are mutually exclusive — the chart enforces it via rollbackStore.backend. The controller also accepts controller-runtime’s --zap-* logging flags.