Synthetic monitoring

The Synthetic Checks page (Reliability → Synthetic checks) is where you declare every external probe Vigilo should run on a schedule. It covers five check types: HTTP/HTTPS request, TCP port, DNS lookup, ICMP ping, and Heartbeat (dead-man's-switch).

Heads-up on the old "Monitoring" module. Vigilo used to ship a separate "Monitoring" surface for declaring host registrations and capturing per-host TLS snapshots. That has been consolidated into Certificates, which now owns "what hosts do we scan, and what cert did we see." Synthetic Checks (this page) is the active probe-scheduling surface; Certificates is the cert inventory + scan surface. The legacy /ws/<slug>/monitoring URL redirects to /certificates so old bookmarks keep working.

Overview

A SyntheticCheck represents one probe that Celery executes on a configurable interval. Each check produces a stream of SyntheticResult rows tagged with status (ok / fail), response time, and any error message. Per-check stats (uptime, p50 / p95 / p99 latency, sample count) are computed on demand for the last 7 days.

The page lays the checks out as a card grid with status-derived dots (green = active, amber = snoozed, red = heartbeat overdue, grey = disabled). The header KPI strip summarises Total / Active / Snoozed / Overdue heartbeats so the operator sees the workspace state without scanning rows.

Check types

HTTPS request — full TLS handshake plus a GET on the URL; expected status (default 200) and optional response-body substring match.
HTTP request — same as HTTPS, plaintext.
TCP port — raw socket connect with a timeout. Useful for SMTP / SSH / database ports where an HTTP probe wouldn't apply.
DNS lookup — resolves a hostname against the workspace's configured resolver and asserts a record type returned.
ICMP ping — ping -c1 with a configurable timeout. Useful for layer-3 reachability checks of internal hosts that don't expose ports to the scanner.
Heartbeat (dead-man's-switch) — Vigilo issues a unique URL; your cron job / batch worker POSTs to it on every run. When the gap between pings exceeds the expected interval, Vigilo flips the check to "overdue" and fires the configured alerts. Cronitor / Healthchecks.io style.

Common workflows

1. Create a check

Click the + button in the page header.
Pick a check type. The form re-renders to show only the fields that type needs (URL for HTTP / HTTPS; hostname + port for TCP; etc.).
Set the check interval. Most types default to 5 minutes; heartbeats default to "expect a ping every 15 minutes."
Save. Synthetic execution starts immediately on the next Celery tick.

2. View per-check stats + recent results

Click any card. The side panel shows:

7-day stats grid: samples, uptime %, p50 / p95 / p99 latency.
The last 20 results with status, response time, and error message.
For heartbeats: the ping URL with a copy-snippet helper.

3. Run a check on demand

Per-card or in the detail panel: Run now triggers an out-of-band probe and refreshes the result list. Useful right after a deploy.

4. Snooze a check

Snooze 1h suppresses alerts without disabling scheduling — the probes keep running so the result history stays continuous, but rules don't fire. Snooze auto-expires; no manual unsnooze needed.

5. Edit a check

The pencil icon opens the edit dialog. You can rename, change targets (URL or host/port), adjust interval, toggle active. The check type stays read-only after creation — recreate the check if you need a different probe shape.

6. Delete a check

The trash icon. Confirms before removing the check and its full result history. The action is irreversible.

Permissions

Action	Required permission
View checks + stats	`monitoring.view`
Create / edit / delete	`monitoring.edit`
Run on demand, snooze	`monitoring.edit`

Troubleshooting

"The card shows latency … and never resolves." — The check has no results yet (newly created, or every probe so far has failed before measuring latency). Click Run now to force one immediate execution and inspect the result message.

"Heartbeat says overdue but my cron is running." — Check the ping URL the workspace expects against what your cron actually hits. Workspace migrations or domain renames change the URL prefix; copy the current value from the detail panel.

"Uptime % drops sharply at midnight every day." — Likely the workspace's resolver or the target host briefly fails during scheduled maintenance. Combine a maintenance window with a snooze so the overnight failures don't pollute SLO calculations.

Certificates — cert inventory + TLS scan + expiry alerts (what used to live under "Monitoring")
Alert rules — wire synthetic failures to email / webhook / Slack
SLOs — turn long-running synthetic uptime into error-budget burn alerts
Assets — link a synthetic check to the asset whose health it represents