Reliability

Alert rules

Alert rules turn raw scan results into notifications. They are the single configuration surface that decides who hears about an expiring certificate, a…

Last updated

Alert rules turn raw scan results into notifications. They are the single configuration surface that decides who hears about an expiring certificate, a failed scan, or an issuer change — and how often.

Where to find them. Alert rules no longer have a dedicated sidebar entry. Each module owns its own rules now — open Certificates → Bell icon in the page header to manage certificate-scoped rules, or Monitoring → Bell icon for host/uptime-scoped rules. Both icons deep-link to the same Alert Rules page with a pre-filtered scope.

Overview

An AlertRule is a workspace-scoped row that links a trigger (a condition observable on monitoring data) to one or more channels (email recipients and webhook endpoints), constrained by an alert scope (the whole workspace or a specific list of hosts) and a cooldown (a minimum interval between fires). Each rule also carries an owner modulecertificate or monitoring — that determines which page's Bell icon surfaces it.

Rules are evaluated by the FastAPI dispatcher (alert_dispatch.py) at the end of every probe, every renewal run, and every periodic sweep. The dispatcher walks active rules for the workspace, asks each one whether it matches the latest snapshot, and — if it does — calls the channel layer to send the message.

Why it exists

Hard-coded notifications never fit every team. Some operators want a single weekly digest, others want a Slack ping at 30, 14 and 1 days before expiry plus a PagerDuty call for expired. Alert rules give you a small, composable building block so the on-call rotation can tune signal/noise without involving engineering.

Key concepts

  • AlertRule fieldsname, trigger_on, days_threshold (only used by certain triggers), scope (the owner module: certificate or monitoring — drives which Bell icon surfaces the rule), alert_scope (target breadth: workspace or host), scope_hosts[] (UUIDs when alert_scope is host), recipients[] (list of email addresses + webhook IDs), cooldown_minutes (default 1440), last_fired_at (auto-managed), is_active.
  • Triggers — Vigilo ships these built-in triggers:
    • expiring_soon — fires when days_until_expiry <= days_threshold for any in-scope host.
    • expired — fires when days_until_expiry < 0 (a fresh fire after each renewal attempt).
    • scan_failed — fires when a probe raises an error two consecutive times.
    • renewal_failed — fires when CertificateRenewalPolicy run ends in failed or needs_manual.
    • cert_issuer_changed — fires when consecutive snapshots show a different issuer DN (WA.15).
    • cert_anomaly — fires when WB.27 detects a change in protocol, key, cipher or SAN.
  • Channels — email is the default. Webhook channels reuse the Webhook rows from the Integrations app, so an existing Slack or PagerDuty hook can be selected from a dropdown rather than re-entered.
  • Cooldown semantics — the dispatcher records last_fired_at per (rule, host) pair. A rule will not re-fire for the same host inside the cooldown window even if the condition stays true. Cooldown resets the moment the condition flips back to false.
  • Rate limiting — on top of cooldown, the dispatcher caps email volume at 60 messages per minute per workspace to protect the SMTP relay. Excess events are coalesced into a single digest queued for the next minute.
  • Workspace SMTP fallback — if the workspace has not configured outbound SMTP under Admin → SMTP, alerts fall through to the platform default sender. The rule will still fire; only the From: address differs.

Common workflows

1. Create an alert rule for 30/14/1-day expiry

  1. Open Certificates and click the Bell icon in the page header. (Or, for host-uptime rules, Monitoring → Bell.) The Alert Rules page opens pre-filtered to that module's scope, so the rule you create is automatically tagged scope=certificate (or monitoring).
  2. Click + New rule.
  3. Name: Cert expiry — production.
  4. Trigger: expiring_soon. Days threshold: pick a single integer here (you can create three rules for 30/14/1, or rely on per-host notify_days_before and a single threshold of 30).
  5. Alert scope: workspace (or host with a host-picker if you only care about a subset).
  6. Recipients: type email addresses (validated inline) and pick webhook channels from the dropdown.
  7. Cooldown: leave at 1440 minutes (24 hours) so the same host does not re-fire daily within the same threshold band.
  8. Save. The rule becomes active immediately and is surfaced under the same Bell icon you opened it from.

2. Wire a rule to a Slack channel via webhook

  1. Open Integrations → Webhooks and add an Incoming Webhook with the Slack URL. Test it once using the Send test button — it appears as WebhookDelivery row with status delivered on success.
  2. Back in Alert rules → + New rule, in the Recipients field, pick the webhook from the dropdown.
  3. When the rule fires, the dispatcher posts a JSON payload (alert_rule_id, trigger, host, snapshot_id, days_until_expiry) to the webhook target. Slack formats it via the workspace incoming-webhook templating.

3. Send a test alert without waiting for a real event

  1. Open the rule's detail page.
  2. Click Send test (E3).
  3. The dispatcher invokes the rule's channel layer with a synthetic snapshot describing a fake host. Recipients receive a clearly-labelled [TEST] message.
  4. The result panel shows per-channel status — delivered, retrying, failed — so you can debug a misconfigured webhook before a real event hits.

4. Temporarily disable a noisy rule

  1. Open the rule detail.
  2. Toggle Active off, or extend Cooldown to a larger value (e.g. 10080 minutes = 7 days).
  3. Disabled rules remain visible in the list with a muted style.

5. Audit which rule fired and when

  1. From either Certificates → Bell or Monitoring → Bell, switch the Alert Rules page to the Activity tab. The tab respects the current ?scope= filter, so you only see fires from the module you're investigating.
  2. The activity table lists every fire across every rule with fired_at, trigger, host, recipient, and the resulting WebhookDelivery or EmailLog link.
  3. Filter by date, rule or host. Drop the ?scope= query string to see fires across both modules. Export to CSV for incident retrospectives.

Permissions

Action Roles
View rules All workspace members
Create or edit rule Operator, Admin, Owner
Delete rule Admin, Owner
Send test Operator, Admin, Owner
Configure workspace SMTP Admin, Owner (link from the rule editor)
View activity log All authenticated workspace members

Rule endpoints inherit WorkspaceScopedMixin. A rule from workspace A is never visible to workspace B.

Troubleshooting

Rule looks correct but never fires. Check three things in order. First, is_active on the rule. Second, alert_scope — a rule scoped to a host must list that host's UUID in scope_hosts. Third, last_fired_at — the cooldown may have suppressed the fire. The Activity table shows suppressed evaluations as cooldown rows.

I can't find a rule I know I created. Check the Bell icon you opened the Alert Rules page from. A rule created via the Certificates Bell carries scope=certificate and is hidden when you open the page via the Monitoring Bell (and vice versa). To list every rule regardless of owner module, navigate directly to /monitoring/alerts without the ?scope= query parameter.

Webhook delivery shows failed for hours. Webhook deliveries retry with exponential backoff up to 5 attempts (1 min, 5 min, 30 min, 2 h, 6 h). After that the delivery is dead-lettered. Open the delivery row to see the response status and body, then fix the receiver and click Retry.

Emails arrive from noreply@vigilo.dev instead of our own domain. The workspace SMTP profile has not been saved. Visit Admin → SMTP and configure host, port, credentials, From: address. After saving, future fires use that profile.

An expiring-soon rule fires once at 30 days then nothing at 14 or 7. Either you only configured one days_threshold, or cooldown is too long. The cleanest model is one rule per threshold band; alternatively, set notify_days_before on each host to [30, 14, 7, 1] and have the rule fire on transitions through any of those points.

scan_failed fires immediately on a brand new host. A single failed probe does not fire the rule — two consecutive failures must occur. If you see otherwise, check whether the host already had failure history before you re-added it.

Related articles

  • Host monitoring — the data source for every trigger except SLO burn alerts.
  • Service Level Objectives — for availability and latency alerts, not certificate alerts.
  • Status page — auto-publish a public note when a rule fires above a severity threshold.