Runbooks
One page per status.conditions[Ready].reason the controller sets, plus a few
operational alert runbooks. Each page covers the symptom, the cause, how to
diagnose it, and how to remediate.
Point the controller at a published copy of these pages with --runbook-base-url
(for example https://stageset.projects.metio.wtf/runbooks); the reason is then
appended to each actionable Ready message. Healthy reasons (Succeeded,
Suspended) get no link.
- ArtifactNotFound
The referenced ExternalArtifact could not be found (transient; the controller requeues).
- Controller pod down
A stageset-controller pod has been NotReady for the alert window.
- DependencyNotReady
A StageSet named in spec.dependsOn is not yet Ready.
- DowngradeRequiresMigration
The desired version is below the deployed version and a migration boundary blocks the downgrade.
- InvalidSpec
The StageSet spec is invalid; the Message names the offending field or action.
- InvalidVersion
A version source or value could not be parsed as semver.
- PreviousRevisionUnavailable
rollbackOnFailure is set but the last-good revisions could not be restored.
- Reconcile latency high
Reconcile p99 latency for the StageSet controller is above threshold.
- ResolveFailed
A source reference could not be resolved to a ready ExternalArtifact.
- SourceNotReady
The source exists but has not published a ready artifact yet (transient).
- StageFailed
A stage failed to fetch, build, apply, verify, or run an action; the run halts there.
- Stalled
The run cannot make progress and will not retry until the spec changes.
- Succeeded
All stages applied and verified the healthy steady state.
- Suspended
Reconciliation is paused via spec.suspend.
- UpdateDeferred
A new revision is held by a closed update window.
- Webhook cert renewal failing
The self-signed admission webhook certificate is not being rotated.
- Workqueue saturation
The controller cannot drain its reconcile queue fast enough.