Overview
Vigilo is a five-process application: a Django REST API, a FastAPI scanner, a Celery worker, a Celery beat, and a React SPA. They share a Postgres cluster (two databases — vigilo_django and vigilo_fastapi), a Redis instance (queue + cache + pub-sub), and an OIDC identity provider (Keycloak by default). All five run behind Nginx in production, on a single VM for small deployments or split across hosts for big ones.
Every workspace is row-level isolated. The "multi-tenant" in "multi-tenant SaaS" is enforced at the database, not the application — even a buggy ViewSet that forgot to scope by workspace can't leak data across boundaries because Postgres won't return the rows.
Why it exists
Most "monitoring + ITSM + compliance" stacks are three different vendors stitched together with brittle webhooks. Each one has its own identity layer, its own audit log, its own RBAC, and three of them all want to be your "single pane of glass." Vigilo's whole architectural premise is one codebase, one identity, one audit log — which is achievable precisely because we're a Django monolith with a tightly-coupled FastAPI sidecar, not a microservices mesh.
The split between Django and FastAPI exists for one reason: the certificate scanner does a lot of small I/O-bound network calls, and FastAPI's async story is significantly better than Django's. Putting it in its own process keeps a thousand parallel TLS handshakes from clogging the request/response cycle of the ITSM UI.
Key concepts
- Django (port 8000 / 9101 dev) — Owns 95% of the data model and almost all writes. Serves the JSON API at
/ws/<slug>/api/v1/.... Authenticates via OIDC tokens validated against Keycloak. Every model carriesworkspace_id; every ViewSet extendsWorkspaceScopedMixin. - FastAPI (port 8001 / 9102 dev) — The certificate scanner + monitoring scrapers. Owns the
vigilo_fastapidatabase (snapshots, scan results, latency timeseries). Talks to the Django DB only to read the list ofMonitoredHostrows it should be scanning. Exposes a small read API at/ws/<slug>/monitor/...for the SPA's certificate detail panels. - Celery worker — Background jobs. Cert renewal kicks, webhook dispatch, playbook execution, SCIM provisioning, periodic compliance evaluations. Every task takes
workspace_idas its first argument so the audit row is correctly scoped. - Celery beat — Scheduler. Triggers cert scans on their cadence, sweeps expiring tokens, reconciles asset inventory, fires "stale change" reminders.
- React SPA (port 9100 dev) — Vite-built single-page app served by Nginx in prod. All routes under
/ws/<slug>/...are SPA routes; only/ws/<slug>/api/...and/ws/<slug>/monitor/...hit a backend. - Postgres — Two logical DBs in one cluster. Row-level security policies on every workspace-scoped table; the application sets
current_setting('vigilo.workspace_id')on every connection. - Redis — Three uses: Celery queue, app-level cache (per-workspace TTLs), and the WebSocket pub-sub channel that lets Django push live events to the React UI.
- Keycloak — Default OIDC provider. Any OIDC-compliant IdP works (Okta, Auth0, Azure AD). Vigilo doesn't store passwords.
Common workflows
Data flow: a typical certificate scan + alert
- Celery beat wakes up on its cadence (default 1-minute tick), reads the next due
MonitoredHostrows from Postgres, and enqueuesvigilo.monitor.scan_hosttasks for each. - The FastAPI scanner picks up the task (Celery routes monitor tasks to the FastAPI worker queue), opens a TLS connection, captures the handshake details, and writes a
CertificateSnapshotrow. - The scanner calls
dispatch_event('cert.scanned', ...). The event lands on a Redis pub-sub channel. - The Django AlertRule evaluator (subscribed to that channel) re-evaluates every rule scoped to that host. If a rule transitions from false to true, it writes an
Alertrow and callsdispatch_event('alert.opened', ...). - The webhook dispatcher (also subscribed) reads the alert, finds matching webhooks for the workspace, and enqueues a
vigilo.integrations.deliver_webhooktask per match. - The Celery worker fires the HTTP POST, records the delivery in
WebhookDelivery, and retries with exponential backoff on failure. - Meanwhile the SPA, connected via WebSocket to the workspace's live channel, gets the same
alert.openedevent and invalidates the relevant React Query cache — the user's screen updates without a refresh.
That whole loop, from scan to user notification, runs in 1-3 seconds on a healthy stack.
How the installer wires it up
scripts/install.sh (or vigilo-ctl install on Windows) provisions a single-VM deployment. It:
- Installs Postgres 15, Redis 7, Python 3.11+, Node 20, and Nginx.
- Creates the
vigilo_djangoandvigilo_fastapidatabases. - Runs migrations, including the RLS policy migration (
workspaces/migrations/0002_enable_rls.py). - Sets up four systemd units (Django/Gunicorn, FastAPI/Uvicorn, Celery worker, Celery beat) under
infra/systemd/. - Drops an Nginx config (
infra/nginx/vigilo.conf) that fronts everything and TLS-terminates. - Optionally provisions Keycloak with a
vigilorealm + default client.
In dev, start-dev.ps1 runs the five processes against docker-compose.dev.yml (Postgres, Redis, Keycloak only).
Where each piece runs in prod
| Component | Process | Port (internal) | URL exposed by Nginx |
|---|---|---|---|
| Django REST | Gunicorn | 8000 | /ws/<slug>/api/v1/*, /admin |
| FastAPI scanner | Uvicorn | 8001 | /ws/<slug>/monitor/*, /docs |
| Celery worker | celery -A vigilo_celery worker | n/a | (none — internal only) |
| Celery beat | celery -A vigilo_celery beat | n/a | (none — internal only) |
| React SPA | Nginx static | (Nginx) | everything else under / |
Permissions & gating
Architectural choices, not user permissions, but worth knowing:
- Workspace isolation is enforced in two layers:
WorkspaceScopedMixinfilters every ViewSet by the URL slug, AND Postgres RLS makes that filter mandatory at the DB level. Both layers exist so a bug in one doesn't leak data. - Platform admin is the only role that crosses workspace boundaries. The platform-admin surfaces live under
/ws/<slug>/platform/...(executive dashboard, cost attribution, plugins) and use a separatePlatformAdminMixinthat disables RLS scoping.
Troubleshooting
- "FastAPI says 'database is up' but Django says it's down." — They use different DBs in the same cluster. Check both with
pg_isready -d vigilo_djangoandpg_isready -d vigilo_fastapi. - "Celery tasks just sit in PENDING." — Worker isn't connected to Redis, or it's listening on a different queue.
celery -A vigilo_celery inspect activefrom the worker host. - "Cert scans are spiking Postgres CPU." — Snapshot writes are hitting the same DB as ITSM reads. Scale up the
vigilo_fastapiDB independently — they're already separated logically; in prod you can put them on separate clusters. - "WebSocket disconnects every 30 seconds." — Nginx
proxy_read_timeouttoo low. Bump to 120s in the/monitor/location block. - "RLS error 'permission denied for table changes'." — The session variable
vigilo.workspace_idwasn't set. Happens when raw SQL bypasses the Django session middleware. Use the ORM, or set the variable explicitly.