Reliability

Service Level Objectives

Service Level Objectives (SLOs) translate "the service should be reliable enough" into a measurable target. Vigilo's SLO module computes error budget and…

Last updated

Service Level Objectives (SLOs) translate "the service should be reliable enough" into a measurable target. Vigilo's SLO module computes error budget and burn rate continuously, surfaces them on dashboards, and fires multi-window alerts so on-call hears about a fast burn long before the budget is gone.

Overview

An SLO row defines a target percentage over a rolling window. For example, "99.9% availability over 30 days" gives roughly 43 minutes of budget per window. Every minute the vigilo.slo.tick Celery beat task recomputes the burn for every active SLO from the underlying SLOEvent rows, updates cached metrics, and — if the burn crosses a configured boundary — dispatches a burn-rate alert.

Vigilo's SLO computation is intentionally pragmatic, not a full SRE platform. It covers the three SLO types operators reach for most often: availability, latency and error-rate. Custom SLO formulas are a roadmap item (see below) but not in the current release.

Why it exists

Alerting on every individual probe failure produces fatigue. Alerting only on hard outages misses gradual erosion. Burn-rate alerting solves both problems: you alert when budget consumption over a short window is unusually high relative to the target, which catches both sudden spikes and slow leaks. The status page, the heatmap on the dashboard and the executive monthly report all consume the same SLO numbers, so leadership and on-call argue from the same data.

Key concepts

  • SLO fieldskey (SLO-NNN auto-assigned), name, slo_type (availability, latency, error_rate, custom), target_percent (e.g. 99.9), window_days (default 30), linked_asset (optional FK), is_active, last_health (cached: healthy, at_risk, breached).
  • SLI — the Service Level Indicator is the raw metric. For availability it is the fraction of successful probes; for latency it is the fraction of probes under a threshold (configured on the SLO row); for error rate it is 1 - error_count/total.
  • Error budget(1 - target) * window. A 99.9% target over 30 days yields 43.2 minutes. Vigilo expresses budget in absolute units (minutes, seconds, or requests) and as a percentage remaining.
  • Burn rate — how fast budget is being consumed relative to the rate at which it should be consumed. A burn rate of 1.0 over the full window means you will land exactly on the target; 5.0 means you are burning five times too fast.
  • Multi-window burn-rate alerts (WB.30) — Vigilo evaluates four windows in parallel: 1 h at 5×, 6 h at 5×, 24 h at 1×, 72 h at 1×. A fire requires both the short and the matching long window to be above the threshold, which suppresses single-spike false positives.
  • SLOEvent — a row per breach interval. Each event has started_at, ended_at, kind (unavailable, slow, errored), and source (probe_failure, external_metric). Burn is the sum of event durations inside the window divided by the budget.

Common workflows

1. Create an availability SLO for a public endpoint

  1. Open Reliability → SLOs → + New SLO.
  2. Name: Checkout API availability.
  3. Type: availability.
  4. Target percent: 99.9.
  5. Window: 30 days.
  6. Linked asset: select the checkout-api Asset row (this lets the dependency map and the asset detail page surface the SLO health).
  7. Save. The card appears on the SLO dashboard with last_health = healthy and burn 0.

2. Read the burn-rate gauges on the dashboard

  1. Open Reliability → SLOs.
  2. Each SLO card has four mini gauges: 1 h, 6 h, 24 h, 72 h. The 1 h and 6 h gauges show fast-burn; the 24 h and 72 h gauges show slow-burn.
  3. A gauge turns amber at 0.8 of its threshold, red at the threshold. A red 1 h paired with a red 6 h triggers the fast multi-window alert; a red 24 h paired with a red 72 h triggers slow.
  4. Click any gauge to drill into the events that contributed to the burn, with a timeline view.

3. Investigate a breached SLO

  1. The card shows Breach with a red border and a count of contributing events.
  2. Click the card to open the SLO detail page.
  3. The top chart shows minute-by-minute SLI value across the window. Drag-select a region to zoom.
  4. The Events table lists every SLOEvent in the window — start, end, duration, kind, source. Click an event to see the probe failures or external metric anomalies that triggered it.
  5. The right-hand panel shows linked incidents (when the SLO breach was severe enough to auto-open one) and any change requests deployed during the window.

4. Pause an SLO during planned maintenance

  1. Open the SLO detail.
  2. Click Settings → Add maintenance window and pick a start / end. Maintenance windows exclude their interval from both budget and burn calculations.
  3. Active maintenance is also visible on the dashboard card as a small wrench icon.

5. Receive burn-rate alerts in Slack

  1. The dispatcher publishes burn events through the standard alert pipeline. Create an AlertRule with trigger slo_burn (visible only when SLOs are enabled) and target the SLO key.
  2. Pick the Slack webhook in Recipients.
  3. The fire payload includes both window values so Slack messages render "1 h burn 6.2×, 6 h burn 5.1× — Checkout API availability".

Permissions

Action Roles
View SLOs All workspace members
Create or edit SLO Operator, Admin, Owner
Delete SLO Admin, Owner
Add maintenance window Operator, Admin, Owner
Configure burn-rate alert thresholds Admin, Owner

All SLO endpoints inherit WorkspaceScopedMixin. Even with shared asset names across workspaces, an SLO from workspace A is invisible to workspace B.

Troubleshooting

An SLO shows breached but the linked endpoint looks fine. Open the events table. The most common cause is stale SLOEvent rows still inside the rolling window from an outage earlier in the week. They will age out at the window boundary. Alternatively the SLI source may be a probe that times out for a non-availability reason — check the probe error.

Burn-rate gauges all read zero immediately after creating an SLO. SLO computation requires at least one full minute tick. Wait 60 seconds, then refresh. If the gauges stay at zero for longer than 5 minutes, check vigilo.slo.tick in the system health panel.

Latency SLO shows budget 0 from day one. You forgot to set the latency threshold (the cutoff that defines "slow"). Edit the SLO and set a millisecond threshold; the budget recomputes on the next tick.

Multi-window alert fires but only the short window shows red. Check the threshold sliders in Settings → Burn-rate alerting. The slow-window threshold is configurable per workspace; a too-low slow threshold causes one-sided fires.

I want a custom formula across multiple metrics. Custom SLO builder is on the roadmap. Today, you can store an external SLI in SLOEvent directly via the API, then create an SLO with slo_type = custom and target_percent = 100 to surface it as a budget. The richer builder UI ships in a later release.

Related articles

  • Alert rules — wire burn-rate fires to email and webhooks.
  • Status page — auto-publish a public note when an SLO breaches.
  • Host monitoring — the probe data that feeds availability SLIs.