rule @community/rule-slo-required-for-prod

Slo Required For Prod

An SLO is the contract between the service team and its users about how reliable the service must be. The SLO has three required components: (1) an SLI — the precise mathematical definition of 'good' (e.g.…

Skill: @community
Domain: ops-observability
Version: 1.0.0
Quality: 4.0
Edges: 5 out · 6 in
Tokens: 231/642/1339

$ prime install @community/rule-slo-required-for-prod

Projection

3 levels · agent picks one per query

Always in _index.xml · the agent never has to ask for this.

SloRequiredForProd [rule] v1.0.0

Every user-facing service in production must publish at least one Service Level Objective (SLO) — a numeric target (99.9% / 99.95% / 99.99%) for a specific Service Level Indicator (SLI) measured over a rolling window (typically 28 or 30 days). No SLO = no production launch.

An SLO is the contract between the service team and its users about how reliable the service must be. The SLO has three required components: (1) an SLI — the precise mathematical definition of 'good' (e.g. 'fraction of HTTP requests with status < 500 AND latency < 500ms p95'), (2) a target — the percentage that must be 'good' (99.9% / 99.95% / 99.99%), and (3) a window — the rolling time period (28 or 30 days). The SLO defines the 'error budget' = (1 − target) × window; consuming the budget too quickly triggers freeze policies; consuming it slowly buys headroom for risk-taking.

Loaded when retrieval picks the atom as adjacent / supporting.

SloRequiredForProd [rule] v1.0.0

An SLO is the contract between the service team and its users about how reliable the service must be. The SLO has three required components: (1) an SLI — the precise mathematical definition of 'good' (e.g. 'fraction of HTTP requests with status < 500 AND latency < 500ms p95'), (2) a target — the percentage that must be 'good' (99.9% / 99.95% / 99.99%), and (3) a window — the rolling time period (28 or 30 days). The SLO defines the 'error budget' = (1 − target) × window; consuming the budget too quickly triggers freeze policies; consuming it slowly buys headroom for risk-taking.

Applies To

Every production HTTP/gRPC service with external or internal customers
Batch pipelines (define SLO on freshness or completion-time)
Asynchronous workers / queues (SLO on message-age or processing-rate)
Mobile/desktop clients (SLO on crash-free sessions)
Internal platforms (the platform team has SLOs FOR the product teams who consume it)

Implementation Checklist

SLI defined as a precise query against the metrics backend (PromQL, MQL, Datadog query) — not prose
SLO target chosen from {99.9%, 99.95%, 99.99%} — extra nines = 10× harder; pick the lowest target the user need supports
Rolling 28- or 30-day window (matches business cycle; avoids end-of-month resets)
Error-budget burn-rate alerts at 1h-fast (14.4× burn) and 6h-slow (6× burn) — see Google SRE Workbook Ch.5
SLO published in a centralised registry (Sloth, Pyrra, Datadog SLO Manager, OpenSLO YAML)
Quarterly SLO review: was the target met? Was it too easy? Was it too hard? Adjust based on user signals
Error-budget policy committed (rule-error-budget-policy.prime) — what happens when budget is exhausted

Severity

block

Counter Examples

Service deployed with Datadog dashboards but no SLO. On-call alerts fire on 'error rate > 1%' — but is 1% bad? The team has no answer; they ack and move on. Reliability degrades imperceptibly.
SLO defined as '100% uptime' — impossible target; ignored within weeks; reverts to no SLO.
SLO defined as 'fast response times' (no number) — non-actionable; cannot drive decisions.
Vague window: 'we aim for 99.9% in a typical month' — no formal start/end; impossible to measure adherence.

Loaded when retrieval picks the atom as a focal / direct hit.

SloRequiredForProd [rule] v1.0.0

An SLO is the contract between the service team and its users about how reliable the service must be. The SLO has three required components: (1) an SLI — the precise mathematical definition of 'good' (e.g. 'fraction of HTTP requests with status < 500 AND latency < 500ms p95'), (2) a target — the percentage that must be 'good' (99.9% / 99.95% / 99.99%), and (3) a window — the rolling time period (28 or 30 days). The SLO defines the 'error budget' = (1 − target) × window; consuming the budget too quickly triggers freeze policies; consuming it slowly buys headroom for risk-taking.

Applies To

Every production HTTP/gRPC service with external or internal customers
Batch pipelines (define SLO on freshness or completion-time)
Asynchronous workers / queues (SLO on message-age or processing-rate)
Mobile/desktop clients (SLO on crash-free sessions)
Internal platforms (the platform team has SLOs FOR the product teams who consume it)

Implementation Checklist

SLI defined as a precise query against the metrics backend (PromQL, MQL, Datadog query) — not prose
SLO target chosen from {99.9%, 99.95%, 99.99%} — extra nines = 10× harder; pick the lowest target the user need supports
Rolling 28- or 30-day window (matches business cycle; avoids end-of-month resets)
Error-budget burn-rate alerts at 1h-fast (14.4× burn) and 6h-slow (6× burn) — see Google SRE Workbook Ch.5
SLO published in a centralised registry (Sloth, Pyrra, Datadog SLO Manager, OpenSLO YAML)
Quarterly SLO review: was the target met? Was it too easy? Was it too hard? Adjust based on user signals
Error-budget policy committed (rule-error-budget-policy.prime) — what happens when budget is exhausted

Severity

block

Counter Examples

Service deployed with Datadog dashboards but no SLO. On-call alerts fire on 'error rate > 1%' — but is 1% bad? The team has no answer; they ack and move on. Reliability degrades imperceptibly.
SLO defined as '100% uptime' — impossible target; ignored within weeks; reverts to no SLO.
SLO defined as 'fast response times' (no number) — non-actionable; cannot drive decisions.
Vague window: 'we aim for 99.9% in a typical month' — no formal start/end; impossible to measure adherence.

Examples

Google Cloud Spanner SLO: 99.999% availability for multi-region instances (5.26 minutes/year of downtime) — published; refunds offered if missed.
Sloth (open-source): generate Prometheus recording + alerting rules from a YAML SLO spec — single source of truth.
OpenSLO (CNCF, Nobl9 et al.): YAML schema for SLO definitions portable across vendors — alpha but stabilizing.
GitHub status page exposes SLOs publicly: API requests target 99.9% availability over 28 days; visible to all customers.

Rationale

Without an SLO, 'reliability' is a feeling. The team cannot decide whether to ship a feature, whether to roll back, or whether to invest in resilience work, because there is no shared definition of 'reliable enough'. The SLO converts reliability into a quantity that can be budgeted, traded against velocity, and used to align engineering, product, and SRE on the same target. Google's SRE Book is built around this primitive; AWS Well-Architected, Microsoft Azure Well-Architected, and the CNCF SLO Working Group all converge on the same model.

Applies To

Every production HTTP/gRPC service with external or internal customers
Batch pipelines (define SLO on freshness or completion-time)
Asynchronous workers / queues (SLO on message-age or processing-rate)
Mobile/desktop clients (SLO on crash-free sessions)
Internal platforms (the platform team has SLOs FOR the product teams who consume it)

Implementation Checklist

SLI defined as a precise query against the metrics backend (PromQL, MQL, Datadog query) — not prose
SLO target chosen from {99.9%, 99.95%, 99.99%} — extra nines = 10× harder; pick the lowest target the user need supports
Rolling 28- or 30-day window (matches business cycle; avoids end-of-month resets)
Error-budget burn-rate alerts at 1h-fast (14.4× burn) and 6h-slow (6× burn) — see Google SRE Workbook Ch.5
SLO published in a centralised registry (Sloth, Pyrra, Datadog SLO Manager, OpenSLO YAML)
Quarterly SLO review: was the target met? Was it too easy? Was it too hard? Adjust based on user signals
Error-budget policy committed (rule-error-budget-policy.prime) — what happens when budget is exhausted

Severity

block

Counter Examples

Service deployed with Datadog dashboards but no SLO. On-call alerts fire on 'error rate > 1%' — but is 1% bad? The team has no answer; they ack and move on. Reliability degrades imperceptibly.
SLO defined as '100% uptime' — impossible target; ignored within weeks; reverts to no SLO.
SLO defined as 'fast response times' (no number) — non-actionable; cannot drive decisions.
Vague window: 'we aim for 99.9% in a typical month' — no formal start/end; impossible to measure adherence.

Derived From

@community/principle-three-pillars-observability

Source

where the compiled artifact came from

prime-system/examples/frontend-design/primes/compiled/@community/rule-slo-required-for-prod/atom.yaml

Compiled at 2026-05-07