Reliability

Certificate renewal automation

Certificate renewal automation closes the loop between detection and remediation. Instead of e-mailing an operator "this cert expires in 14 days", Vigilo…

Last updated

Certificate renewal automation closes the loop between detection and remediation. Instead of e-mailing an operator "this cert expires in 14 days", Vigilo can re-key, re-issue, deploy and verify the new certificate without anyone touching a CLI.

Overview

A CertificateRenewalPolicy describes how a single hostname should be renewed: which CA provider to use (Let's Encrypt, AWS ACM, GCP CM, manual), which key backend to use for the private key, validation method, cooldown, and any provider-specific configuration. Each policy can be run on demand or in bulk, in dry-run or live mode.

Each run becomes a CertificateRenewalRun row with its own state machine (idle → queued → running → succeeded / failed / needs_manual), a structured log, and — on success — a fresh CertificateSnapshot to confirm the new chain is being served.

Why it exists

Manual renewal procedures rot. Engineers leave, runbooks fall out of date, the team that owns the cert is not the team that owns the load balancer. Automating the whole loop and giving it a visible Kanban board means the renewal status of every cert is obvious at a glance — and a failed renewal opens an incident instead of disappearing into someone's inbox.

Key concepts

  • CertificateRenewalPolicy fieldshostname, provider (letsencrypt, aws_acm, gcp_cm, manual), challenge (http-01, dns-01, none), key_backend (local, vault, aws_kms), key_backend_config (JSON), cooldown_seconds, dry_run_default, notify_on_failure (list of email addresses).
  • Providers — Let's Encrypt is fully implemented via the ACME adapter. AWS ACM uses the public ACM API to issue and import certificates. GCP CM is a stub at the moment — the wiring is in place but issuance is gated until WD.20 ships. manual lets you stage a CSR and upload the resulting certificate by hand while still benefiting from the Kanban tracking.
  • Key backends (WC.6)local keeps the private key encrypted on disk; vault uses HashiCorp Vault's PKI/Transit engines; aws_kms keeps the key inside KMS and signs CSRs there. The key_backend_config JSON carries the credentials reference (a secret name, not the secret value).
  • Dry-run mode (WB.26) — simulates the ACME flow end-to-end without actually requesting a certificate. Useful for validating DNS records, IAM permissions or webroot paths before going live. The run completes in seconds and produces a structured report of what would have happened.
  • Cooldown — minimum interval between consecutive runs for the same hostname. Prevents thundering-herd behaviour against a CA's rate-limit endpoints.
  • Cost-aware bulk renewal (WD.12) — when you bulk-renew, the orchestrator picks the cheapest viable provider per hostname based on policy + workspace defaults. ACM has a per-issuance cost; Let's Encrypt is free; private CAs may have throughput cost. The Kanban shows the chosen provider per card.
  • Failure escalation (T1.7) — a failed or needs_manual run automatically opens an Incident linked to the host, with severity derived from the host's criticality. Operators see the new incident in their queue without anyone forwarding the email.

Common workflows

1. Create a Let's Encrypt policy for a public host

  1. Open Reliability → Cert renewal → + New policy.
  2. Hostname: api.example.com.
  3. Provider: letsencrypt. Challenge: http-01 (or dns-01 if your DNS is provider-supported).
  4. Key backend: local for the simple case; vault or aws_kms if your security policy requires it. Fill in the relevant config (Vault path, KMS key ARN, etc.).
  5. Notify on failure: type one or more email addresses.
  6. Save. The policy row appears with status idle.

2. Run a renewal in dry-run mode first

  1. Open the policy detail.
  2. Click Run → Dry run.
  3. The run executes the full ACME flow up to but not including the final issuance call. It checks DNS records, webroot reachability, IAM permissions and webhook reachability.
  4. The result page shows each step with a green tick or red X, plus the raw ACME log. Fix any issues and re-run dry-run until it is fully green.
  5. Click Run → Live to issue for real.

3. Bulk-renew an environment overnight

  1. From Cert renewal main view, click the Kanban tab.
  2. Filter to the scope (e.g. environment = prod, days_until_expiry <= 21).
  3. Click Select all in scope, then Bulk run.
  4. Choose Cost-aware to let Vigilo pick the cheapest provider per host, or Use policy default to stick with each policy's configured provider.
  5. The Kanban populates with one card per host. Cards move through Queued → Running → Succeeded / Failed. Failures auto-open an incident.

4. Recover from a needs_manual run

  1. The Kanban card turns purple. Click to open the run detail.
  2. The log explains why automation stopped — common reasons: ACME challenge could not be removed, KMS signature mismatch, target load balancer refused the upload.
  3. Address the issue (e.g. manually clean up the validation record, fix the IAM policy).
  4. Click Resume. The run picks up at the failed step instead of restarting from scratch.
  5. If resume is not possible, click Mark resolved manually and upload the new certificate. The next probe will confirm the chain.

5. Audit renewal history for an auditor

  1. Open Cert renewal → History.
  2. Filter by provider, status, hostname or operator.
  3. Export to CSV. The CSV includes run id, hostname, provider, key backend, started_at, ended_at, status, operator, run mode (live / dry-run), cost estimate.

Permissions

Action Roles
View policies + runs All workspace members
Create or edit policy Operator, Admin, Owner
Delete policy Admin, Owner
Run dry-run Operator, Admin, Owner
Run live Operator, Admin, Owner
Resume needs_manual Operator, Admin, Owner
Configure key backend (KMS/Vault credentials) Admin, Owner

Policy and run endpoints inherit WorkspaceScopedMixin. Key backend credentials are stored encrypted with workspace-specific CMEK (see Architecture → Encryption) so a compromised workspace cannot decrypt another's keys.

Troubleshooting

Live run fails immediately with "rate limited". Let's Encrypt enforces strict per-domain and per-account limits. Wait the time the error indicates, or use the staging endpoint via key_backend_config.endpoint = "staging" while debugging.

Dry-run succeeds but live run fails at the deployment step. The deployment step is the only one dry-run cannot fully validate. Check the deployment hook in the run log — typically an SSH key or load-balancer API token has expired.

KMS signing returns AccessDeniedException. The IAM role used by the FastAPI scanner needs kms:Sign and kms:GetPublicKey on the configured key. Update the policy and re-run.

Cost-aware bulk renew chose ACM when I wanted Let's Encrypt. Cost-aware uses provider cost plus your workspace's preferred_provider order. Bump Let's Encrypt above ACM in Settings → Cert renewal → Provider preference, or run with Use policy default instead.

Renewal succeeded but cert page still shows the old fingerprint. The probe schedule has not caught up. Click Check now on the host in the Hosts page; the snapshot table updates within a few seconds.

Related articles