Incident lifecycle

Overview

An incident in Vigilo is "something went wrong and we're tracking the response in one place." The lifecycle is intentionally short — five states, no branches — because in the middle of a P1 you don't have the cognitive budget to navigate a fancy state machine. The states are: open → investigating → mitigated → resolved → postmortem.

Incidents glue together the rest of Vigilo: they link to the change that caused them (if any), the host or service that's affected, the on-call rota that owns the response, the status-page incident that's visible to customers, and — once it's all over — the postmortem document that captures the learning.

Why it exists

The proximate cause of most "we didn't respond well" retros is friction between tools. Your alerting is in PagerDuty, your war room is in Slack, your status page is in Statuspage.io, your runbook is in Confluence, your customer comms is in Intercom, and your blameless postmortem template is in Google Docs. Vigilo absorbs the parts that benefit most from being in your ITSM (the timeline, the on-call assignment, the change link, the status-page publish, the postmortem) and leaves the chat in Slack where it belongs.

Key concepts

Incident — Title, description, severity (sev1 … sev5), affected services, on-call assignee, escalation chain, opened-at, current state. Owned by a workspace.
State — open, investigating, mitigated, resolved, postmortem. Each transition writes an audit row + fires incident.<verb> events.
Severity — sev1 (full outage, customer-impacting), sev2 (major degradation), sev3 (degraded, not user-visible yet), sev4 (internal only), sev5 (informational). Drives auto-page and status-page behaviour.
On-call rota — A rotation schedule that resolves to a single user at any given time. Set in Incidents → On-call → schedules. Acknowledgement, escalation, and override are all logged.
Escalation chain — Ordered list of (rota, delay) tuples. If the primary doesn't ack within the delay, the next rota in the chain is paged.
Status-page incident — A user-facing record published to a public status page. Auto-created (T1.8) for sev1 and sev2, manual for everything else.
Storm aggregation (WD.8) — When many alerts fire on the same affected service within a window, Vigilo collapses them into a single parent incident and links the alerts to it. Prevents 200 emails for one underlying outage.
Postmortem — A collaborative markdown document attached to the incident. Has its own state (draft → review → published) and writes a final audit entry on published.

The state diagram

open ──assign──► investigating ──mitigate──► mitigated ──resolve──► resolved ──postmortem──► postmortem
  │                                                                      │
  └──── cancel (false alarm) ────────────────────────────────────────────►│
                                                                  cancelled

open is the only state that lets the on-call decline ("not actually an incident") into a terminal cancelled state. Once you've moved to investigating, the only way out is forward through resolved.

Common workflows

1. Open an incident

Three paths:

Auto-open — An alert rule with severity sev1 or sev2 flips true. Vigilo opens an incident, attaches the alert as the first timeline entry, and pages the on-call.
From a change — A change in in_progress is marked Failed or Rolled-back. Vigilo opens a linked incident automatically for risk = high changes; the link is bi-directional and surfaces on both records.
Manual — Sidebar → Incidents → New incident. Pick severity, title, affected services, optional on-call override.

open event fires. The status page publishes if sev1/sev2.

2. Acknowledge + investigate

The on-call user gets paged (in-app + integration if configured). They click Ack → state moves to investigating. Their identity is recorded as the responder.

The timeline becomes the war-room log: paste links, add notes, attach the relevant logs/dashboards. Every comment writes to the audit row.

3. Mitigate

Once the user-visible impact is over (even if the root cause isn't fixed) → Mark mitigated. State moves to mitigated. The status-page incident updates to "monitoring."

This is an intentional separate state — distinguishing "they can buy things again" from "we understand why it broke" is one of the cheapest reliability wins available.

4. Resolve

When the root cause is fixed and you're confident it won't recur → Resolve. State moves to resolved, the status-page incident closes, the MTTR clock stops. The incident.resolved event fires; playbooks subscribed to it can auto-create a Jira ticket for the fix-forward work, post in Slack, etc.

5. Postmortem

For sev1 and sev2, Vigilo prompts the on-call to draft a postmortem within 48 hours (configurable). Click Start postmortem → a markdown editor opens with the workspace's postmortem template pre-filled (timeline, impact, root cause, contributing factors, action items, action item owners).

The postmortem has its own approval flow: draft → review → published. Published postmortems are indexed in the knowledge base and surface as related links on future incidents touching the same services.

Escalation chain

Configured in Incidents → On-call → escalation chains. A chain is per-severity (and optionally per-service):

Step	Rota	Delay
1	Primary SRE	0 min
2	Backup SRE	10 min
3	SRE manager	20 min
4	Engineering VP	45 min

If step 1 doesn't ack within 10 min, step 2 is paged in parallel. If step 2 doesn't ack within another 10 min, step 3. Each escalation writes to the timeline.

Storm aggregation (WD.8)

When a single underlying outage triggers many alerts (e.g. a load balancer dies and 47 hosts behind it stop responding), Vigilo's storm detector groups them. Rules:

Alerts firing within a configurable window (default 60 s).
Sharing at least one affected asset or asset tag.
Above a configurable count threshold (default 5).

are merged into a single parent incident. Child alerts link back. Resolving the parent auto-resolves the children when their conditions clear.

Permissions & gating

Action	Roles allowed
View incidents	All workspace members
Open an incident manually	`member`, `approver`, `admin`, `owner`
Acknowledge	The on-call user OR any `admin`/`owner`
Mitigate / resolve	The responder OR any `admin`/`owner`
Cancel (false alarm)	`admin`, `owner`
Edit on-call rotas	`admin`, `owner`
Publish postmortem	`admin`, `owner`

Troubleshooting

"On-call didn't get paged." — Check the rota is active and the integration (PagerDuty, Opsgenie, webhook) is delivering. Webhooks panel under Integrations shows the delivery audit.
"Status page didn't publish for a sev1." — Status page is configured but the workspace's status-page is off, or the affected services field is empty. Open the incident → ensure services are listed → click "Publish manually."
"Two incidents for the same outage." — Storm aggregation didn't catch them — they don't share an affected asset. Merge them: open the duplicate → Merge into → pick the parent. The duplicate becomes a child.
"Postmortem reminder fired but the incident is closed." — Reminder cadence is configurable per workspace. Settings → Incidents → Postmortem cadence.
"Audit log shows a transition I didn't authorise." — Look at the actor column — it may have been a Celery task (e.g. auto-resolve on alert clear) or a playbook. Both are valid actors and labelled as such.