Work

Freeze and maintenance windows

Vigilo distinguishes two superficially-similar concepts that solve different problems:

Last updated

Overview

Vigilo distinguishes two superficially-similar concepts that solve different problems:

  • A FreezeWindow blocks change requests from being submitted or scheduled during its interval.
  • A MaintenanceWindow suppresses monitoring alerts (and optionally auto-approves standard changes) during its interval.

The same week of the year might be both — a holiday code freeze plus a planned datacenter maintenance — but the two are configured separately because they protect against different failure modes. Freeze windows protect against humans deploying when they shouldn't; maintenance windows protect against humans being paged when the system is intentionally noisy.

Both live under Settings → Maintenance at /ws/{slug}/settings/maintenance and ship with their own forms, calendar overlay, and audit history.

Why it exists

Every operations team has a Black Friday, a year-end close, a tax-filing deadline. Trying to enforce "no changes this week" by emailing the engineering list is how outages happen. The FreezeWindow model encodes the policy in the FSM so the system, not a human, says no — and records the override when leadership says yes.

Maintenance windows solve the inverse: when you've taken a service down deliberately, you don't want to wake the on-call. Suppressing alerts for the planned interval (and only that interval) keeps the monitoring signal honest.

Key concepts

FreezeWindow

  • start_time / end_time — The closed interval [start, end] during which submission and scheduling are blocked.
  • allow_override — When True, admins and owners can push a change through with a justification. When False, this is a hard lockout — no one can override (typical for P0 incident lockouts).
  • applies_to_priorities — Optional list. Empty means "applies to all changes"; populated means "only blocks changes whose priority is in this list". A common pattern is to freeze ["low", "medium"] during a release week so emergency hotfixes still flow.
  • applies_to_change_types — Optional list of change types the freeze applies to. Empty = all types. Letting emergency through is a common configuration.
  • reason — Free-text shown in the block error and the audit log. Required by convention.
  • is_active — Soft toggle. An inactive freeze never blocks regardless of dates.

MaintenanceWindow

  • start_time / end_time — The interval during which monitoring alerts are suppressed.
  • affected_hosts — List of host UUIDs the window applies to. Empty means all hosts in the workspace.
  • suppress_ssl_alerts — When True (default), SSL/certificate alerts emitted for affected_hosts during the window are dropped.
  • auto_approve_standard — When True, standard changes whose planned window falls inside this maintenance window auto-approve on submit.
  • recurrence — One of none, weekly, monthly. Recurring windows are evaluated by covers(when) against the current period.

FSM enforcement

Two transitions on ChangeRequest call assert_not_frozen from apps.changes.services:

  • submit() — Only checks when planned_start and planned_end are set. Engineers can still draft titles and descriptions during a freeze; they just can't lock a window that lands inside one.
  • schedule() — Always checks. The change has been through review and now needs a concrete slot, so we always run the guard.

When a freeze is hit, the FSM raises a DRF ValidationError with code freeze_window_blocked and a blockers array listing each overlapping window's id, title, start_time, end_time, and allow_override flag.

Common workflows

Declaring a freeze window

  1. Go to Settings → Maintenance → Freeze windows → New.
  2. Fill in title, reason, start time, end time. Pick scope: leave applies_to_priorities and applies_to_change_types blank for a total freeze, or narrow it.
  3. Choose allow_override. The default True lets admins override with justification; flip to False for a hard lockout.
  4. Save. Active freezes show as a red banner on the changes list during their interval and a pink overlay on the change calendar.

Overriding a freeze

  1. From a blocked change, click Override freeze. The action is only visible to users with the freeze_windows.override permission (admin / owner).
  2. The override dialog requires a justification of at least 20 characters. This text is recorded in the audit log alongside the user, timestamp, and the IDs of the freezes overridden.
  3. The change's submit() or schedule() is re-run with override=True. If every overlapping freeze has allow_override=True, the transition completes. If any freeze has allow_override=False, the override is refused regardless of caller permission.

Declaring a maintenance window

  1. Go to Settings → Maintenance → Maintenance windows → New.
  2. Pick start, end, and recurrence. For a one-off, leave recurrence as none.
  3. Add affected_hosts from the asset picker. Leave empty to apply to all hosts.
  4. Choose suppress_ssl_alerts (default on) and auto_approve_standard (default off).
  5. Save. The window appears on the monitoring dashboard with a muted-alerts badge.

Permissions

  • Viewers can see active freezes and maintenance windows but cannot edit.
  • Engineers can read the policy and submit changes that respect it.
  • Approvers / admins / owners can create, edit, and delete windows.
  • Only admins and owners can override a freeze, and only when the freeze allows it.

Troubleshooting

freeze_window_blocked on submit — Your planned window overlaps an active freeze. The error payload lists each blocker. Either move the change window outside the freeze, or ask an admin to override (only possible if every blocker has allow_override=True).

Override refused with "hard freeze" — One of the overlapping freezes has allow_override=False. There is no recourse short of an admin deactivating the freeze record, which is itself an audit-worthy action.

Standard changes still need approval inside maintenanceauto_approve_standard only applies to changes flagged is_standard=True whose entire planned window falls inside the maintenance window. Partial overlaps don't qualify.

SSL alerts firing during planned maintenance — Confirm the host UUID is in affected_hosts, the window is is_active=True, and the current time falls inside [start_time, end_time]. Recurring windows use the current period's projection, not the original start_time literally.

Related