Reliability

Public status page

The status page is Vigilo's customer-facing surface for service health. It is intentionally separate from the internal incident system: what gets…

Last updated

The status page is Vigilo's customer-facing surface for service health. It is intentionally separate from the internal incident system: what gets published to the world is a curated subset of internal data, sanitised to avoid leaking sensitive details like raw IPs or internal hostnames.

Overview

A StatusPage row carries the public branding (name, logo, custom domain), one or more StatusPageComponent rows (the user-facing services — Web app, API, Auth, etc.), StatusPageIncident rows (announcements with state machine: investigating → identified → monitoring → resolved), scheduled maintenance windows, and email subscribers.

A component can be linked to an Asset, in which case the status badge auto-tracks the asset's health. Unlinked components are entirely manual. Incidents can be created manually or auto-published from internal incidents when their severity exceeds the workspace threshold (T1.8).

Why it exists

Customers want to know whether the service they pay for is broken without raising a ticket. A clean status page reduces inbound support load by an order of magnitude during an outage, and a believable history (a page with too few historical incidents looks dishonest) builds trust. Vigilo aims to make the page so easy to keep current that there is no excuse for letting it lie.

Key concepts

  • StatusPage — one per workspace by default; you can also create per-product pages for multi-product accounts. Fields: name, slug (subdomain), custom_domain, is_public, theme, support_url, default_components_layout.
  • StatusPageComponentname, description, position, status (operational, degraded, partial_outage, major_outage, maintenance), is_public, linked_asset (optional).
  • StatusPageIncident — title, body, current state, severity, list of impacted components, timestamps for each state transition, optional linkage to an internal Incident.
  • Auto-publish (T1.8) — when an internal Incident exceeds Workspace.settings.status_page_auto_publish_severity and Incident.auto_publish_to_status is true, the dispatcher creates a sanitised StatusPageIncident, mapping internal severity to public severity and removing private fields.
  • Sanitiser — a small allowlist function that strips raw IPv4/IPv6 addresses, internal FQDNs ending in your workspace-configured private suffix list, and any text matching the workspace's secret-pattern regex. The original incident body remains untouched internally.
  • Embed widget (WD.10)widget.html is a self-contained snippet you can embed in marketing pages. It pulls component status as JSON and renders a single coloured bar.
  • RSS feed — every page exposes /feed.rss and /feed.atom so users can subscribe in their feed reader of choice.
  • Scheduled maintenance — define start and end timestamps; Vigilo auto-posts an announcement at the configured lead time (default 24 hours) and posts a closing note when the window ends.

Common workflows

1. Stand up your first status page

  1. Open Reliability → Status page.
  2. Click Create page. Fill in Name and Slug (e.g. acme becomes acme.status.vigilo.com).
  3. Optional: add a Custom domain (CNAME instructions are shown inline).
  4. Add components: click + Component, name it (Web app), set initial status (operational), optionally link to an existing Asset.
  5. Toggle Is public when you are ready. Until then the page returns 404.

2. Post an incident manually

  1. From the status page admin, click + New incident.
  2. Title, body, list of impacted components, severity.
  3. Save with state = investigating. Subscribers receive the initial email; the page shows the incident at the top of the history.
  4. As the situation evolves, click + Update and pick the new state (identified, monitoring, resolved). Each update sends one email per subscriber and one RSS entry.
  5. The component status returns to operational only when the incident is moved to resolved and you confirm the prompt.

3. Enable auto-publish from internal incidents

  1. Open Settings → Status page → Auto-publish.
  2. Set Severity threshold to e.g. high. Internal incidents at high or critical will auto-publish; low and medium will not.
  3. Per-incident override: on the incident editor, toggle Publish to status page off if you want to keep a specific high-severity incident private.
  4. When an internal incident triggers auto-publish, the sanitiser is applied and a draft StatusPageIncident is created in state investigating. Admins receive a notification to review and approve, or you can configure auto-approve in the same settings page.

4. Schedule a maintenance window

  1. From the page admin, click + Maintenance.
  2. Title, start time, end time, impacted components, lead-time for the heads-up announcement (default 24 h).
  3. Save. Vigilo posts an investigating-style announcement at the lead time, switches the components to maintenance at start time, and posts a resolved note at the end time.
  4. Cancelling a window before its start removes all announcements and restores component status.

5. Embed the widget on your marketing site

  1. Open Status page → Embed.
  2. Copy the widget.html snippet (5 lines of HTML referencing a CDN-hosted script).
  3. Paste into your site. The widget polls the public JSON endpoint every 60 seconds and renders a coloured strip with hover tooltip.

Permissions

Action Roles
View admin All workspace members
Create or edit page Admin, Owner
Create or update incident Operator, Admin, Owner
Approve auto-published incident Admin, Owner
Manage subscribers Admin, Owner
Schedule maintenance Operator, Admin, Owner
Configure custom domain Owner

Public viewers do not log in. They consume status.example.com, /feed.rss and /widget.html anonymously. No workspace data leaks across the page boundary thanks to the sanitiser and the explicit is_public flag on each component.

Troubleshooting

Custom domain CNAME is set but the page shows a TLS error. Vigilo issues a certificate for the custom domain via Let's Encrypt the first time a request arrives. Issuance can take up to 90 seconds after the CNAME resolves. Hard-refresh after a minute. If the error persists, check that no AAAA record is conflicting with the A record.

Auto-published incident contains an internal hostname. The sanitiser's allowlist of private suffixes is workspace-configurable. Open Settings → Status page → Sanitiser and add the suffix (e.g. .corp.example.com). The active incident will not be re-sanitised retroactively — edit the body manually.

Subscribers complain they did not receive an email. Open the incident's Delivery tab. Each subscriber row shows status: queued, delivered, soft-bounced, hard-bounced, unsubscribed. Hard-bounces remove the address from the list automatically.

Component status flaps between operational and degraded. The linked Asset is probably crossing a probe threshold repeatedly. Either widen the asset's health hysteresis or unlink the component and manage its status manually during the flaky period.

RSS feed returns 404. The page is is_public = false. Flip it on. The feed URL only resolves for public pages.

Related articles

  • Service Level Objectives — long-term reliability tracking; complements the per-incident status page.
  • Alert rules — wire internal alerts to status-page incidents through auto-publish.
  • Host monitoring — the probe data behind linked component health.