Administration

Playbooks (workflow automation)

A Playbook is Vigilo's workflow automation: when an event happens (or on a schedule, or on demand), evaluate a tree of conditions, then execute an ordered…

Last updated

Overview

A Playbook is Vigilo's workflow automation: when an event happens (or on a schedule, or on demand), evaluate a tree of conditions, then execute an ordered set of actions — create tasks, open changes, notify channels, fire webhooks. Playbooks live at /ws/{slug}/settings/playbooks and execute on the Celery worker pool. Each execution writes a PlaybookRun row so the entire history is auditable.

Where Webhooks push a raw event to one URL, a Playbook is a workspace-internal multi-step recipe that can chain actions, branch on conditions, and feed step outputs into later step inputs.

Why it exists

Operations is repetition. The same six steps happen every time a sev-2 incident opens: post to Slack, page on-call, create a follow-up task, open a Jira ticket, schedule a post-mortem doc. Forcing humans to remember those six steps loses one or two of them under stress. The Playbook system codifies the recipe once, in the workspace, and re-applies it identically every time the trigger fires, with the full audit log of what ran and what it returned.

Key concepts

Playbook model

Playbook (workspace, name, trigger (string event key or cron), cron_expr (nullable), conditions (json), steps (json), is_active, created_by, created_at).

  • trigger — an event key (incident.opened, change.submitted, member.invited, etc.) or the literal cron for scheduled playbooks.
  • conditions — JSON tree in the conditions DSL (see below) evaluated against the event payload.
  • steps — ordered JSON array of step objects: {id, action, inputs, on_failure}.

PlaybookRun

PlaybookRun (playbook, trigger_data JSON, status (running | success | partial | failed), started_at, completed_at, step_results JSON (per-step result map)). The detail view renders each step with its status, inputs, outputs, and duration — the per-action timeline.

PlaybookEvaluator + PlaybookExecutor (T1.4)

Two services split the work:

  • PlaybookEvaluator — called on every dispatched event, walks active playbooks, evaluates their conditions against the payload, and enqueues a PlaybookRun for matches.
  • PlaybookExecutor — Celery task that consumes a PlaybookRun, walks the steps in order, calls each action with its rendered inputs, collects results, and finalises the run status.

Action types

The action catalog (T1.4):

  • create_task — create a Task row.
  • create_change — create a ChangeRequest draft.
  • add_comment — add a Comment to a referenced entity.
  • notify — emit a notification (in-app + Slack/Teams/email per the user's prefs).
  • webhook — call a registered Webhook by name.
  • run_runbook — start a RunbookExecution.
  • wait — pause for a duration or until a condition.
  • manual_check — pause until a named user clicks Continue (used in human-in-the-loop flows).

DAG playbooks (WB.1)

Beyond linear steps, playbooks support DAG semantics: a step can declare depends_on: [stepA, stepB] and the executor will wait for both before running. Combined with on_failure: continue | abort | branch:<step_id> you get conditional branching, parallel fan-out, and joins.

Each step also supports retries (max retries with backoff) and timeout_seconds (wall-clock limit per attempt). On timeout, the step is failed and on_failure is honoured.

Input mapping

Step inputs are rendered through a template DSL with the syntax ${steps[X].output.Y} (WB.3) to reference outputs from earlier steps. Example:

{
  "id": "create_followup",
  "action": "create_task",
  "inputs": {
    "title": "Follow up on ${trigger.incident.number}",
    "linked_change": "${steps.create_change.output.id}"
  }
}

Available namespaces in the template: ${trigger.*} (the event payload), ${steps.<id>.output.*} (any earlier step's result), ${env.*} (workspace-level non-secret variables).

Conditions language

conditions is a JSON tree built from primitive comparators and boolean combinators:

{"and": [
  {"eq": ["${trigger.severity}", "sev2"]},
  {"in": ["${trigger.service}", ["auth", "payments"]]},
  {"not": {"eq": ["${trigger.actor.role}", "bot"]}}
]}

Supported primitives: eq, neq, gt, gte, lt, lte, in, contains, matches (regex), exists. Combinators: and, or, not. Evaluated short-circuit, left-to-right.

Scheduled playbooks (WB.4)

Setting trigger='cron' with a cron_expr registers the playbook with the Celery beat scheduler. Scheduled playbooks have no trigger event payload; the conditions evaluate against a synthetic payload {schedule: {now: <ISO>, cron: <expr>}}.

Useful for recurring jobs: weekly cert audit, daily DORA snapshot post to Slack, monthly stale-task cleanup.

Secrets vault — deferred (WB.5)

A first-class Secrets vault for storing per-action credentials (API keys, database passwords used by playbook steps) is on the roadmap (WB.5). Today, step credentials are either looked up from existing IntegrationConfig rows (which use the CMEK descriptor) or passed as workspace-level non-secret variables. Do not put plaintext secrets in playbook step inputs; they will be visible in PlaybookRun.trigger_data and in the audit log.

Common workflows

Create a simple playbook

  1. Settings → Playbooks → New playbook, name it ("Sev-2 incident response").
  2. Pick trigger incident.opened.
  3. Conditions: {"eq": ["${trigger.severity}", "sev2"]}.
  4. Add steps in the visual editor: notify Slack channel, create_task for the on-call, webhook to the PagerDuty integration.
  5. Activate.

Inspect a run

  1. Playbooks → {name} → Runs, click a row.
  2. The detail page shows each step with timing, inputs (post-template), and outputs.
  3. Failed steps show the error class and any stderr from the executor.

Use input mapping

A two-step playbook to open an incident and immediately create a follow-up task referencing it:

[
  {
    "id": "open_inc",
    "action": "open_incident",
    "inputs": {"title": "Cron-detected anomaly", "severity": "sev3"}
  },
  {
    "id": "create_followup",
    "action": "create_task",
    "depends_on": ["open_inc"],
    "inputs": {
      "title": "Investigate ${steps.open_inc.output.number}",
      "linked_incident": "${steps.open_inc.output.id}"
    }
  }
]

Schedule a weekly cert audit

  1. New playbook, trigger cron, cron 0 8 * * 1 (Mon 08:00 UTC).
  2. Step: webhook to your internal cert-audit endpoint, then notify the security channel with the response summary.
  3. Activate.

Permissions

  • Owners and admins can create, edit, activate, and run-now playbooks.
  • Engineers can run-now an existing playbook from the UI but cannot edit.
  • Approvers / viewers can view the run history but cannot trigger.

Troubleshooting

Playbook never fires on event — Check Runs: a run is logged with status='skipped' when conditions evaluated false. Inspect the trigger payload in the run detail.

Step fails with template_unresolved — Input template references a step output that doesn't exist or hasn't run yet (missing depends_on). Add the dependency.

manual_check step stuck — The named user has not clicked Continue. Reassign from the run detail or override (admin only) to advance.

Cron playbook runs in the wrong timezone — Cron is UTC. Set Workspace.settings.playbook_timezone to your local TZ to interpret cron expressions in that zone.

Step input shows my password in the run log — You put a plaintext secret in inputs. Rotate the secret immediately and use an IntegrationConfig reference instead. The Secrets vault (WB.5) is on the roadmap; until then, this is the only safe pattern.

DAG playbook runs sequentially instead of in paralleldepends_on is missing or every step depends on its immediate predecessor. The executor parallelises independent branches automatically.

Related