How Vigilo works

Overview

Vigilo is a five-process application: a Django REST API, a FastAPI scanner, a Celery worker, a Celery beat, and a React SPA. They share a Postgres cluster (two databases — vigilo_django and vigilo_fastapi), a Redis instance (queue + cache + pub-sub), and an OIDC identity provider (Keycloak by default). All five run behind Nginx in production, on a single VM for small deployments or split across hosts for big ones.

Every workspace is row-level isolated. The "multi-tenant" in "multi-tenant SaaS" is enforced at the database, not the application — even a buggy ViewSet that forgot to scope by workspace can't leak data across boundaries because Postgres won't return the rows.

Why it exists

Most "monitoring + ITSM + compliance" stacks are three different vendors stitched together with brittle webhooks. Each one has its own identity layer, its own audit log, its own RBAC, and three of them all want to be your "single pane of glass." Vigilo's whole architectural premise is one codebase, one identity, one audit log — which is achievable precisely because we're a Django monolith with a tightly-coupled FastAPI sidecar, not a microservices mesh.

The split between Django and FastAPI exists for one reason: the certificate scanner does a lot of small I/O-bound network calls, and FastAPI's async story is significantly better than Django's. Putting it in its own process keeps a thousand parallel TLS handshakes from clogging the request/response cycle of the ITSM UI.

Key concepts

Django (port 8000 / 9101 dev) — Owns 95% of the data model and almost all writes. Serves the JSON API at /ws/<slug>/api/v1/.... Authenticates via OIDC tokens validated against Keycloak. Every model carries workspace_id; every ViewSet extends WorkspaceScopedMixin.
FastAPI (port 8001 / 9102 dev) — The certificate scanner + monitoring scrapers. Owns the vigilo_fastapi database (snapshots, scan results, latency timeseries). Talks to the Django DB only to read the list of MonitoredHost rows it should be scanning. Exposes a small read API at /ws/<slug>/monitor/... for the SPA's certificate detail panels.
Celery worker — Background jobs. Cert renewal kicks, webhook dispatch, playbook execution, SCIM provisioning, periodic compliance evaluations. Every task takes workspace_id as its first argument so the audit row is correctly scoped.
Celery beat — Scheduler. Triggers cert scans on their cadence, sweeps expiring tokens, reconciles asset inventory, fires "stale change" reminders.
React SPA (port 9100 dev) — Vite-built single-page app served by Nginx in prod. All routes under /ws/<slug>/... are SPA routes; only /ws/<slug>/api/... and /ws/<slug>/monitor/... hit a backend.
Postgres — Two logical DBs in one cluster. Row-level security policies on every workspace-scoped table; the application sets current_setting('vigilo.workspace_id') on every connection.
Redis — Three uses: Celery queue, app-level cache (per-workspace TTLs), and the WebSocket pub-sub channel that lets Django push live events to the React UI.
Keycloak — Default OIDC provider. Any OIDC-compliant IdP works (Okta, Auth0, Azure AD). Vigilo doesn't store passwords.

Common workflows

Data flow: a typical certificate scan + alert

Celery beat wakes up on its cadence (default 1-minute tick), reads the next due MonitoredHost rows from Postgres, and enqueues vigilo.monitor.scan_host tasks for each.
The FastAPI scanner picks up the task (Celery routes monitor tasks to the FastAPI worker queue), opens a TLS connection, captures the handshake details, and writes a CertificateSnapshot row.
The scanner calls dispatch_event('cert.scanned', ...). The event lands on a Redis pub-sub channel.
The Django AlertRule evaluator (subscribed to that channel) re-evaluates every rule scoped to that host. If a rule transitions from false to true, it writes an Alert row and calls dispatch_event('alert.opened', ...).
The webhook dispatcher (also subscribed) reads the alert, finds matching webhooks for the workspace, and enqueues a vigilo.integrations.deliver_webhook task per match.
The Celery worker fires the HTTP POST, records the delivery in WebhookDelivery, and retries with exponential backoff on failure.
Meanwhile the SPA, connected via WebSocket to the workspace's live channel, gets the same alert.opened event and invalidates the relevant React Query cache — the user's screen updates without a refresh.

That whole loop, from scan to user notification, runs in 1-3 seconds on a healthy stack.

How the installer wires it up

scripts/install.sh (or vigilo-ctl install on Windows) provisions a single-VM deployment. It:

Installs Postgres 15, Redis 7, Python 3.11+, Node 20, and Nginx.
Creates the vigilo_django and vigilo_fastapi databases.
Runs migrations, including the RLS policy migration (workspaces/migrations/0002_enable_rls.py).
Sets up four systemd units (Django/Gunicorn, FastAPI/Uvicorn, Celery worker, Celery beat) under infra/systemd/.
Drops an Nginx config (infra/nginx/vigilo.conf) that fronts everything and TLS-terminates.
Optionally provisions Keycloak with a vigilo realm + default client.

In dev, start-dev.ps1 runs the five processes against docker-compose.dev.yml (Postgres, Redis, Keycloak only).

Where each piece runs in prod

Component	Process	Port (internal)	URL exposed by Nginx
Django REST	Gunicorn	8000	`/ws/<slug>/api/v1/*`, `/admin`
FastAPI scanner	Uvicorn	8001	`/ws/<slug>/monitor/*`, `/docs`
Celery worker	celery -A vigilo_celery worker	n/a	(none — internal only)
Celery beat	celery -A vigilo_celery beat	n/a	(none — internal only)
React SPA	Nginx static	(Nginx)	everything else under `/`

Permissions & gating

Architectural choices, not user permissions, but worth knowing:

Workspace isolation is enforced in two layers: WorkspaceScopedMixin filters every ViewSet by the URL slug, AND Postgres RLS makes that filter mandatory at the DB level. Both layers exist so a bug in one doesn't leak data.
Platform admin is the only role that crosses workspace boundaries. The platform-admin surfaces live under /ws/<slug>/platform/... (executive dashboard, cost attribution, plugins) and use a separate PlatformAdminMixin that disables RLS scoping.

Troubleshooting

"FastAPI says 'database is up' but Django says it's down." — They use different DBs in the same cluster. Check both with pg_isready -d vigilo_django and pg_isready -d vigilo_fastapi.
"Celery tasks just sit in PENDING." — Worker isn't connected to Redis, or it's listening on a different queue. celery -A vigilo_celery inspect active from the worker host.
"Cert scans are spiking Postgres CPU." — Snapshot writes are hitting the same DB as ITSM reads. Scale up the vigilo_fastapi DB independently — they're already separated logically; in prod you can put them on separate clusters.
"WebSocket disconnects every 30 seconds." — Nginx proxy_read_timeout too low. Bump to 120s in the /monitor/ location block.
"RLS error 'permission denied for table changes'." — The session variable vigilo.workspace_id wasn't set. Happens when raw SQL bypasses the Django session middleware. Use the ORM, or set the variable explicitly.