pattern @community/pattern-error-budget-policy

Error Budget Policy

A pre-agreed, written policy that describes what happens automatically when a service consumes its error budget faster than the SLO target allows — typically a feature-freeze, a focus on reliability work, or shifting on-…

Skill: @community
Domain: ops-observability
Version: 1.0.0
Quality: 4.0
Edges: 6 out · 5 in
Tokens: 84/1030/2129

$ prime install @community/pattern-error-budget-policy

Projection

3 levels · agent picks one per query

Always in _index.xml · the agent never has to ask for this.

ErrorBudgetPolicy [pattern] v1.0.0

Loaded when retrieval picks the atom as adjacent / supporting.

ErrorBudgetPolicy [pattern] v1.0.0

Facts

Label

Error-Budget Policy

Problem

An SLO without an enforcement mechanism is an aspiration. The team that owns velocity will always find one more feature to ship; the team that owns reliability will always want to invest in toil reduction; without a policy, every conversation reduces to negotiation under stress during an incident. Error budget burn is the canonical signal but it is useless unless tied to a deterministic action.

Solution

Codify the response to budget consumption in a written policy, signed by both engineering and product leadership before the SLO goes into effect. The policy specifies (1) thresholds (e.g. 'burn rate > 2× for >6 hours'), (2) automatic gates ('feature freeze on this service'), (3) escalation ('exec review if budget < 0 for two consecutive 28-day windows'), and (4) reset rules. Burn-rate alerts wired to the policy fire a deterministic action — not a Slack message that the team negotiates over. Google SRE Workbook Chapter 4 is the canonical reference.

Structure

    # Burn rate definitions (Google SRE Workbook Ch. 4 & 5)
    burn_rate = (1 - SLI) / (1 - SLO)
    # If SLO=99.9% and current SLI=99.4%, burn rate = 0.6/0.1 = 6×
    # i.e. budget will be exhausted in (28d / 6) = ~4.6 days at this rate

    # Multi-window multi-burn-rate alerts (recommended)
    Fast burn:    >14.4× over 1h    AND  >14.4× over 5min   → page on-call (2% budget in 1h)
    Slow burn:    > 6×   over 6h    AND  > 6×   over 30min  → page on-call (5% budget in 6h)
    Slow drip:    > 1×   over 3d    AND  > 1×   over 6h     → ticket (cumulative)

    # Policy ladder (example; teams customise)
    ┌───────────────────────────────────────────────────────────────────────┐
    │ Stage 1 — Budget remaining > 50%                                      │
    │   Normal operation. Ship features. Ship infra changes.                │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 2 — Budget remaining 20–50%                                     │
    │   Risky changes (DB migrations, region failovers) require sign-off.   │
    │   Reliability work prioritised in next sprint planning.               │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 3 — Budget remaining < 20% OR burn rate > 6× sustained 6h       │
    │   FREEZE: no new feature deploys. Bugfixes + reliability only.        │
    │   Daily burn-rate review by team lead.                                │
    │   Auto-rollback hooks armed for any new release.                      │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 4 — Budget < 0 (SLO violated)                                   │
    │   FREEZE persists. Postmortem triggers a remediation roadmap.         │
    │   Quarterly review with VP-Eng. Budget resets at window end.          │
    └───────────────────────────────────────────────────────────────────────┘

    # YAML policy artifact (Sloth-style)
    apiVersion: openslo/v1
    kind: SLO
    metadata:
      name: checkout-availability
    spec:
      service: checkout
      indicator:
        ratio: { good: "http_requests_total{status!~'5..'}", total: "http_requests_total" }
      objectives:
        - target: 0.999
          op:     ">="
      timeWindow: 28d
      alerting:
        - name: fast-burn
          burnRate: 14.4
          longWindow:  1h
          shortWindow: 5m
          severity:    page
        - name: slow-burn
          burnRate: 6
          longWindow:  6h
          shortWindow: 30m
          severity:    page
      policy:
        budgetRemaining<20pct: { gate: feature-freeze, until: budget>=20pct }
        budgetRemaining<0:     { gate: feature-freeze, exec-review: true }

Loaded when retrieval picks the atom as a focal / direct hit.

ErrorBudgetPolicy [pattern] v1.0.0

Facts

Label

Error-Budget Policy

Problem

Solution

Structure

    # Burn rate definitions (Google SRE Workbook Ch. 4 & 5)
    burn_rate = (1 - SLI) / (1 - SLO)
    # If SLO=99.9% and current SLI=99.4%, burn rate = 0.6/0.1 = 6×
    # i.e. budget will be exhausted in (28d / 6) = ~4.6 days at this rate

    # Multi-window multi-burn-rate alerts (recommended)
    Fast burn:    >14.4× over 1h    AND  >14.4× over 5min   → page on-call (2% budget in 1h)
    Slow burn:    > 6×   over 6h    AND  > 6×   over 30min  → page on-call (5% budget in 6h)
    Slow drip:    > 1×   over 3d    AND  > 1×   over 6h     → ticket (cumulative)

    # Policy ladder (example; teams customise)
    ┌───────────────────────────────────────────────────────────────────────┐
    │ Stage 1 — Budget remaining > 50%                                      │
    │   Normal operation. Ship features. Ship infra changes.                │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 2 — Budget remaining 20–50%                                     │
    │   Risky changes (DB migrations, region failovers) require sign-off.   │
    │   Reliability work prioritised in next sprint planning.               │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 3 — Budget remaining < 20% OR burn rate > 6× sustained 6h       │
    │   FREEZE: no new feature deploys. Bugfixes + reliability only.        │
    │   Daily burn-rate review by team lead.                                │
    │   Auto-rollback hooks armed for any new release.                      │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 4 — Budget < 0 (SLO violated)                                   │
    │   FREEZE persists. Postmortem triggers a remediation roadmap.         │
    │   Quarterly review with VP-Eng. Budget resets at window end.          │
    └───────────────────────────────────────────────────────────────────────┘

    # YAML policy artifact (Sloth-style)
    apiVersion: openslo/v1
    kind: SLO
    metadata:
      name: checkout-availability
    spec:
      service: checkout
      indicator:
        ratio: { good: "http_requests_total{status!~'5..'}", total: "http_requests_total" }
      objectives:
        - target: 0.999
          op:     ">="
      timeWindow: 28d
      alerting:
        - name: fast-burn
          burnRate: 14.4
          longWindow:  1h
          shortWindow: 5m
          severity:    page
        - name: slow-burn
          burnRate: 6
          longWindow:  6h
          shortWindow: 30m
          severity:    page
      policy:
        budgetRemaining<20pct: { gate: feature-freeze, until: budget>=20pct }
        budgetRemaining<0:     { gate: feature-freeze, exec-review: true }

Examples

Google SRE — error-budget policies have been the canonical SRE primitive since 2003; documented in the SRE Book (Ch.3) and Workbook (Ch.4).
Spotify Backstage SLO plugin: surfaces SLO + error budget per service in the developer portal; integrates with feature-freeze tooling.
Nobl9 / Sloth / Pyrra: tooling that generates burn-rate alerts and policy gates from declarative SLO specs.
Etsy / GitHub / Shopify all describe public versions of error-budget policies in engineering blogs.

Label

Error-Budget Policy

Problem

Solution

Structure

    # Burn rate definitions (Google SRE Workbook Ch. 4 & 5)
    burn_rate = (1 - SLI) / (1 - SLO)
    # If SLO=99.9% and current SLI=99.4%, burn rate = 0.6/0.1 = 6×
    # i.e. budget will be exhausted in (28d / 6) = ~4.6 days at this rate

    # Multi-window multi-burn-rate alerts (recommended)
    Fast burn:    >14.4× over 1h    AND  >14.4× over 5min   → page on-call (2% budget in 1h)
    Slow burn:    > 6×   over 6h    AND  > 6×   over 30min  → page on-call (5% budget in 6h)
    Slow drip:    > 1×   over 3d    AND  > 1×   over 6h     → ticket (cumulative)

    # Policy ladder (example; teams customise)
    ┌───────────────────────────────────────────────────────────────────────┐
    │ Stage 1 — Budget remaining > 50%                                      │
    │   Normal operation. Ship features. Ship infra changes.                │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 2 — Budget remaining 20–50%                                     │
    │   Risky changes (DB migrations, region failovers) require sign-off.   │
    │   Reliability work prioritised in next sprint planning.               │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 3 — Budget remaining < 20% OR burn rate > 6× sustained 6h       │
    │   FREEZE: no new feature deploys. Bugfixes + reliability only.        │
    │   Daily burn-rate review by team lead.                                │
    │   Auto-rollback hooks armed for any new release.                      │
    ├───────────────────────────────────────────────────────────────────────┤
    │ Stage 4 — Budget < 0 (SLO violated)                                   │
    │   FREEZE persists. Postmortem triggers a remediation roadmap.         │
    │   Quarterly review with VP-Eng. Budget resets at window end.          │
    └───────────────────────────────────────────────────────────────────────┘

    # YAML policy artifact (Sloth-style)
    apiVersion: openslo/v1
    kind: SLO
    metadata:
      name: checkout-availability
    spec:
      service: checkout
      indicator:
        ratio: { good: "http_requests_total{status!~'5..'}", total: "http_requests_total" }
      objectives:
        - target: 0.999
          op:     ">="
      timeWindow: 28d
      alerting:
        - name: fast-burn
          burnRate: 14.4
          longWindow:  1h
          shortWindow: 5m
          severity:    page
        - name: slow-burn
          burnRate: 6
          longWindow:  6h
          shortWindow: 30m
          severity:    page
      policy:
        budgetRemaining<20pct: { gate: feature-freeze, until: budget>=20pct }
        budgetRemaining<0:     { gate: feature-freeze, exec-review: true }

Derived From

@community/rule-slo-required-for-prod

Compatible

@community/principle-three-pillars-observability

Source

where the compiled artifact came from

prime-system/examples/frontend-design/primes/compiled/@community/pattern-error-budget-policy/atom.yaml

Compiled at 2026-05-07