Analytics and DORA metrics

Overview

The Analytics module turns Vigilo's operational ledger — changes, incidents, approvals, runbook executions — into the four DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, Mean Time to Resolve) plus a library of derived service- and team-level views. It lives at /ws/{slug}/analytics and renders the DoraPage with tabs for Overview, By team, By service, By change type, and Goals.

Because every CHG and INC is already a first-class object in the workspace, DORA scoring is deterministic from the same database the rest of the product uses — no separate event pipeline, no third-party telemetry collector, and no opportunity for the four numbers to disagree with the underlying records.

Why it exists

DORA metrics are the de-facto language for measuring delivery and reliability performance. Most organisations stitch them together from build systems (deploy frequency), ticket systems (lead time), incident tools (MTTR), and Excel (failure rate), with no shared definition or audit trail. Vigilo computes all four from a single source of truth — the ChangeRequest and Incident tables — so the dashboard you stare at in the morning and the export your auditor sees come from the same query.

Key concepts

DoraMetricsApi

The endpoint GET /ws/{slug}/api/v1/analytics/dora/ returns a structured payload with the four metrics over a configurable window (default last 30 days). Query params support ?start=, ?end=, ?group_by=, and ?filter=.

Response shape (simplified):

{
  "window": {"start": "...", "end": "..."},
  "deploy_frequency_per_day": 4.2,
  "lead_time_hours": {"p50": 18.0, "p90": 72.0},
  "change_failure_rate": 0.07,
  "mttr_hours": {"p50": 1.5, "p90": 6.4}
}

Metric definitions

deploy_frequency_per_day — Count of ChangeRequest rows with closure_code='successful' (or successful_with_issues) completed in the window, divided by window length in days.
lead_time_hours — Per-change duration from created_at to completed_at for successfully closed changes; reported as p50 and p90.
change_failure_rate — Count of changes with closure_code='unsuccessful' OR FSM state rolled_back OR with a linked incident opened within 24 hours of completion, divided by total closed changes in the window.
mttr_hours — Per-incident duration from opened_at to resolved_at for incidents resolved in the window; reported as p50 and p90.

The 24-hour incident-link window for failure rate is configurable on Workspace.settings.dora_failure_window_hours.

Grouping

?group_by=team|service|change_type returns the metrics computed per group instead of workspace-wide (WA.10). The response becomes an array of {key, ...metrics} rows. The UI uses this on the By team / By service / By change type tabs to render small-multiples — one mini-card per group with a sparkline.

team group key — WorkspaceMembership.team (if set).
service group key — Asset.id for the primary affected_asset on the change (or incident.service).
change_type group key — ChangeRequest.change_type (standard | normal | emergency).

Regression alerts

A scheduled Celery task (WB.20) runs nightly and compares the rolling 7-day metric values to the rolling 28-day baseline per group. If a metric regresses by more than the threshold (defaults: deploy_frequency down 30%, lead_time up 30%, failure_rate up 50%, mttr up 30%) the system creates a MetricRegressionAlert row and dispatches the metric.regression webhook event. Subscribed channels (Slack, email) receive a formatted summary with a link to the analytics tab.

Thresholds are tunable per workspace under Settings → Analytics → Regression thresholds. Acknowledging an alert silences re-fires for the same metric+group for 24 hours.

DORA goals

Each workspace can declare quarterly goals — deploy_frequency_per_day ≥ 5, lead_time_p50 ≤ 24h, change_failure_rate ≤ 0.10, mttr_p50 ≤ 2h. Goals are stored in Workspace.settings.dora_goals and render as a target line on the trend charts. Goals are not enforced — the workspace will not refuse changes that breach them — but breaches surface a yellow chip on the DORA card and feed into the executive summary widget.

DoraPage UI tabs

Overview — the four headline numbers, each with a 7-day sparkline and the goal line.
By team — small-multiples by team.
By service — small-multiples by service (driven by affected_assets).
By change type — segmented by standard | normal | emergency.
Goals — read-only summary of current goals, last quarter result, and an admin-only edit button.

Common workflows

Read the current DORA score

Open /ws/{slug}/analytics. The Overview tab loads with default 30-day window.
Each card shows the current value, the absolute change vs the previous window, and the goal target line.
Hover the sparkline to see daily values.

Investigate a regression alert

The Slack channel pings: "MTTR (p50) for team payments regressed +45% vs 28-day baseline."
Click through to /ws/{slug}/analytics?group_by=team&team=payments.
Inspect the MTTR sparkline — the spike usually corresponds to one or two long incidents. Hover the spikes to deep-link to the INC numbers.
Acknowledge the alert (button on the alert detail) once investigated.

Set quarterly goals

Admin opens Settings → Analytics → Goals.
Set the four targets and the quarter end date. Click Save.
The goal line appears on every DORA chart.

Export

DORA payloads are exportable as CSV or JSON from the Overview tab action menu. The export honours current filters and grouping. To schedule the same export, see Saved and scheduled reports.

Permissions

Viewers can read all analytics views and the DORA card.
Engineers / approvers can acknowledge regression alerts.
Admins / owners can edit goals, edit regression thresholds, and disable specific groups.

Troubleshooting

DORA card shows "—" for change_failure_rate — Either no changes closed in the window, or the workspace has all changes closure_code='successful' and no linked incidents within the failure window. The card displays a sample-size badge so you can distinguish "zero failures" from "insufficient data".

Regression alert keeps re-firing — The 24-hour suppression only blocks re-fires for the same metric and group. If multiple groups regress, each gets its own alert. If the suppression is genuinely broken, check the Celery beat log for the evaluate_metric_regressions task.

Numbers don't match my external tool — Likely a definition mismatch. Vigilo's deploy_frequency counts successful changes only (mirroring DORA's "successful deploy" definition) — many third-party tools count all attempts. Override behaviour by editing dora_failure_window_hours or by filtering the API directly.

Goal line missing from sparkline — No goal set for that metric, or the goal expired (past the quarter end date). Set a new goal under Settings → Analytics → Goals.