Work

On-call schedules

An OnCallSchedule defines who is responsible for responding to incidents at any given moment. Each schedule has a rotation cadence, an escalation path,…

Last updated

Overview

An OnCallSchedule defines who is responsible for responding to incidents at any given moment. Each schedule has a rotation cadence, an escalation path, and a series of OnCallSlot rows that say user U covers period [start, end]. When an incident pages, the current slot's user is the first to be notified; if they don't acknowledge inside the configured delay, the page escalates to a backup.

Schedules live under People → On-call at /ws/{slug}/oncall. The page shows the current on-call for every schedule, the upcoming hand-offs for the next 14 days, and a "Currently on call" highlight at the top (T2.5) for fast glanceability.

Why it exists

Knowing who to page at 3am is a problem every team solves badly until they solve it. Vigilo's on-call model encodes the rotation in the database, exposes it through the incident-paging pipeline, and surfaces it on the homepage so engineers know they are on the hook before something breaks rather than after.

The model is intentionally simple — slots, primary, escalation — because complex on-call topologies belong in PagerDuty. Vigilo's job is to make the basic "who's up?" answer trivially available and to bridge to PagerDuty / Twilio when you need richer behavior.

Key concepts

  • OnCallSchedule — Workspace-scoped, named (e.g. "Platform — primary"). Has rotation (weekly, daily, custom), an escalation_delay_minutes (default 15), an escalation_to (single FK to UserProfile — the backup), and an is_active flag.
  • OnCallSlot — One concrete coverage period: schedule, user, start_time, end_time, is_override. Slots are generated by the rotation in bulk for the next N weeks; manual overrides set is_override=True so the next rotation sweep knows to preserve them.
  • Primary — Whoever's slot covers now(). The "Currently on call" chip pulls this in real time.
  • Escalation — Currently single-hop: if the primary doesn't acknowledge a page within escalation_delay_minutes, the page is re-sent to escalation_to. Multi-hop escalation chains are deferred (see Troubleshooting).
  • Acknowledgement — A page is acked by either replying with a Slack reaction on the incident channel post (WD.7) or by replying ACK to the Twilio SMS. The ack writes an IncidentResponder row tagged with the ack source so the postmortem can attribute the response.

Common workflows

Creating a schedule

  1. Go to People → On-call → New schedule.
  2. Pick a name, rotation cadence, and escalation backup. Set escalation_delay_minutes (default 15 is sensible).
  3. Add the rotation members in the order you want them to cycle (member 1 covers week 1, member 2 covers week 2, etc.).
  4. Save. The system generates OnCallSlot rows for the next 8 weeks based on the rotation.

Editing a slot (one-off override)

  1. From the schedule's calendar view, click the slot you want to change.
  2. Pick a new user or change the start/end. Save. The slot is stamped is_override=True so the next rotation regeneration doesn't overwrite it.
  3. The "Currently on call" chip and the paging pipeline pick up the new assignment immediately.

Handing off mid-rotation

  1. From the schedule, click Handoff. Pick the user you're handing off to and a start time (defaults to now).
  2. The current slot is end-stamped at the handoff time, a new override slot starts at that time for the new owner, and the rotation resumes its scheduled pattern at the next boundary.
  3. A oncall.handoff event fires through dispatch_event so the previous and next owners both get a Slack DM.

Acknowledging a page

When an incident pages you, the Slack post in the incident channel has reaction emoji for ack and decline. React with the ack emoji to claim the incident. If you got an SMS, reply ACK (or DECLINE). Either action:

  • Writes an IncidentResponder row with role='responder' and joined_at=now().
  • Cancels the pending escalation timer so the backup isn't paged.
  • Posts a confirmation in the incident channel.

If you don't ack inside escalation_delay_minutes, the page re-fires to the schedule's escalation_to user (single-hop only — see Troubleshooting).

Permissions

  • Viewers can read the current and upcoming on-call.
  • Schedule members can edit their own slots (request swaps).
  • Admins / owners can create, edit, and delete schedules and override any slot.

Troubleshooting

The "Currently on call" chip says nobody is on call — Either the schedule has no slot covering now() (the rotation hasn't been extended past the end of the last generated slot), or is_active=False. Open the schedule and click Extend rotation or flip it active.

The page didn't escalate — Check the schedule's escalation_to is set, the user is still an active workspace member, and escalation_delay_minutes is sane. The escalation Celery task logs the decision in the incident timeline.

Multi-hop escalation (primary → secondary → tertiary) — Not yet supported in Vigilo's native scheduler (deferred per audit T3.9). For multi-hop chains, configure the topology in PagerDuty and let Vigilo's PagerDuty integration drive paging. Vigilo's single-hop will still serve as the workspace's "who's up?" view.

My SMS ack didn't register — Twilio inbound webhooks need to point at /integrations/twilio_sms/. Check Settings → Integrations → Twilio for the inbound URL and the last delivery log. Slack reactions are more reliable.

A rotation regeneration overwrote my override — Confirm the slot had is_override=True. The regeneration job preserves any slot with that flag. If the flag was missing (manually-created slot via the admin), re-create the override from the UI; the UI sets it correctly.

Handoff posted but the next slot didn't update — Race condition between the manual edit and the rotation extender. Reload the page; if the slot still looks wrong, click Regenerate from here.

Related