Self-hosting the control plane#

See also: Self-host quickstart (docker compose) — one-command CP + console stack for local eval. This page is the production guide; that page is the fastest way to see it working. Both are linked from the main quickstart.

You have two ways to run Airlock's control plane: let us host it (cp.airlocklabs.ai), or run the same image yourself, next to your worker. Both produce the same product surface — same tunnel handshake, same MCP edge, same audit shape — and both keep your worker (and therefore your DB credentials and raw rows) inside your VPC. The only thing that changes is who runs the CP service.

This page is the fork: when to pick which, and how to actually deploy it if you pick self-hosted.

Hosted vs. self-hosted#

For most readers, hosted is the right pick. Self-host when you have a regulatory or contractual reason that says routing metadata can't leave your cloud account, OR when you need a single-tenant deploy you can audit and freeze yourself.

Hosted CP (cp.airlocklabs.ai)Self-hosted CP (your VPC / your cloud account)
Who runs itAirlockYou
What sees customer rowsWorker only — never CPSame
What CP seesRouting metadata + audit trail + snapshot metaSame
OnboardingOperator token in your inbox, ~5 minProvision an image, deploy, point DNS, ~1–2 hours
Upgrade cadenceAutomatic (we ship security patches)You pull new image versions on your schedule
Compliance fitOK for mostRequired for data-residency, FedRAMP, single-tenant audit guarantees
Cost (your side)Free during private betaCompute + storage + ops time

The data-boundary story is identical either way. The worker stays in your VPC, holds the only DATABASE_URL, exports masked DuckDB snapshots locally, and dials out to CP over WSS. CP never opens an inbound connection to your network and never sees raw rows. See Architecture and Concepts → Self-hosted vs hosted for the full picture.

What self-hosting actually buys you:

  • Routing metadata stays in your account. Tool name, snapshot id, latency, outcome — the audit log shape from Security § 7 — sits on a volume you control rather than ours.
  • Single-tenant runtime. No co-tenancy with other customers' metadata.
  • Frozen versions. You decide when to pull a new image.

What it does not buy you:

  • Different exposure of customer rows. Both modes route through the same worker; the worker is what protects rows, not the CP.
  • A different cryptographic boundary today. Mode A (the default) has the CP relaying SQL + masked results in process memory; that's true whether the CP is on Fly under our account or yours. Mode B (end-to-end encrypted, CP becomes payload-opaque) is on the roadmap and applies equally to both deploy modes — see Concepts → Mode A vs Mode B.

If those tradeoffs don't match a constraint you actually have, take hosted and keep moving.

Prerequisites for self-hosting#

  • Container image. The airlockai/control-pane repo is private during the beta; we ship a signed image artifact as part of onboarding. Email hello@airlocklabs.ai to get registry access. There is no public DIY path today.
  • A reachable HTTPS URL for AIRLOCK_CP_PUBLIC_URL. Workers dial it via WSS (wss://…/v1/tunnel); agents call <public_url>/mcp/<tenant_slug> over HTTPS. The CP refuses to start without it — see app.py::_require_public_url.
  • TLS termination. Fly + Cloudflare DNS-only handle this for free; for bare VM use Caddy or nginx in front. Cloudflare's orange-cloud proxy will interfere with the WSS tunnel — keep DNS records grey-cloud.
  • Persistent volume, ~3 GB minimum. Holds config.yaml, state.yaml, audit.jsonl, the CP's Ed25519 keypair (cp_ed25519.pem), and the snapshot meta sidecars under snapshots/. Loss = re-onboard every worker.
  • Egress allow-list. CP makes outbound calls only to operator OIDC providers (when wired) and your log shipping target. No outbound DB calls — that's the worker's job. CP has no DATABASE_URL field and rejects one in airlock.yaml if a worker ships it.

Reference deploy: Fly.io#

Fly is what we use ourselves. One machine, one volume. Five commands from a clean checkout to a running CP:

flyctl launch --no-deploy --copy-config --name airlock-cp --region iad
flyctl volumes create airlock_state --size 3 --region iad --yes
flyctl secrets set AIRLOCK_CP_OPERATOR_TOKEN="$(openssl rand -hex 32)"
flyctl secrets set AIRLOCK_CP_PUBLIC_URL="https://cp.example.com"
flyctl deploy

Line by line:

  1. launch --no-deploy — creates the Fly app from the committed fly.toml but doesn't boot the machine yet. We need the volume + secrets in place first.
  2. volumes create airlock_state --size 3 — backs the /state mount. 3 GB is plenty for the audit log; resize live later with flyctl volumes extend.
  3. AIRLOCK_CP_OPERATOR_TOKEN — the master credential for /v1/admin/*. Treat it like a root password. Rotate by setting a new value and redeploying; the old token stops working on the next request.
  4. AIRLOCK_CP_PUBLIC_URL — the HTTPS URL the CP advertises to workers. Without it CP refuses to start.
  5. flyctl deploy — first boot seeds config.yaml from config.example.yaml, generates cp_ed25519.pem, opens the tunnel listener.

The minimal fly.toml we ship looks like this:

app = "airlock-cp"
primary_region = "iad"

[build]
  dockerfile = "Dockerfile"

[env]
  AIRLOCK_CP_CONFIG = "/state/config.yaml"
  AIRLOCK_CP_STATE_PATH = "/state/state.yaml"
  AIRLOCK_CP_SNAPSHOT_DIR = "/state/snapshots"
  AIRLOCK_CP_LOG_LEVEL = "INFO"

[mounts]
  source = "airlock_state"
  destination = "/state"
  initial_size = "3gb"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "off"      # long-lived WS tunnels from workers
  min_machines_running = 1

  [[http_service.checks]]
    interval = "30s"
    method = "GET"
    path = "/health"
    timeout = "5s"

[[vm]]
  memory = "512mb"
  cpu_kind = "shared"
  cpus = 1

Two things about this shape worth flagging:

  • auto_stop_machines = "off". Workers hold a long-lived WSS tunnel; a stopped machine drops every connected worker.
  • One replica. Runtime state lives on a single Fly volume rather than Postgres. Fine for pilot; multi-region requires migrating state to Postgres first. See Security § 12.

After deploy, point a custom domain at the Fly app:

flyctl certs create cp.example.com
flyctl certs show cp.example.com   # follow the CNAME / TXT instructions

Set the Cloudflare CNAME for cp to airlock-cp.fly.dev with DNS-only (grey cloud) — orange-cloud will break the WS handshake.

Verify:

curl https://cp.example.com/health     # → ok
curl -H "Authorization: Bearer $AIRLOCK_CP_OPERATOR_TOKEN" \
  https://cp.example.com/v1/admin/tenants     # → {"tenants": []}

Other deploy targets#

The image ships a single Python/Starlette process listening on :8080. Anywhere you can run a container with a persistent volume will work; the shapes below are sketches, not certified runbooks.

Docker Compose#

Fine for a single-VM deploy or a homelab pilot. TLS terminates in front (Caddy / nginx).

services:
  cp:
    image: <your-registry>/airlockai/control-pane:<tag>
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      AIRLOCK_CP_OPERATOR_TOKEN: "${AIRLOCK_CP_OPERATOR_TOKEN}"
      AIRLOCK_CP_PUBLIC_URL: "https://cp.example.com"
      AIRLOCK_CP_CONFIG: "/state/config.yaml"
      AIRLOCK_CP_STATE_PATH: "/state/state.yaml"
      AIRLOCK_CP_SNAPSHOT_DIR: "/state/snapshots"
    volumes:
      - airlock_state:/state

volumes:
  airlock_state:

Kubernetes#

A Deployment (replicas: 1), a PVC for /state, a Service on 8080, and an Ingress doing TLS termination is everything you need. The manifest is straightforward — no operator, no CRDs. We don't ship a Helm chart yet because we'd rather not maintain a half-baked one; ping hello@airlocklabs.ai if you want a starter chart for your cluster shape and we'll send what we have.

A few constraints worth knowing if you're writing the manifest yourself:

  • replicas: 1 until state moves to Postgres. The volume is single-writer.
  • Don't put a Cloudflare-style L7 proxy in front. Nginx / Traefik / Istio with WS support is fine. Cloudflare's orange-cloud is not.
  • /health is GET 200 OK. Use it for both readiness and liveness.

Bare VM with systemd#

If you already have a hardened VM with TLS in front (Caddy, nginx, or an ALB), the smallest deploy is a single docker run behind a systemd unit:

docker run -d --name airlock-cp \
  -p 127.0.0.1:8080:8080 \
  -e AIRLOCK_CP_OPERATOR_TOKEN="$(cat /etc/airlock/op_token)" \
  -e AIRLOCK_CP_PUBLIC_URL="https://cp.example.com" \
  -v /var/lib/airlock/state:/state \
  --restart unless-stopped \
  <your-registry>/airlockai/control-pane:<tag>

Caddy in front of it gets you free TLS:

cp.example.com {
  reverse_proxy 127.0.0.1:8080
}

AWS ECS / GCP Cloud Run#

Both work as long as the runtime supports a persistent volume. Cloud Run gen2 with a Cloud Storage FUSE mount or a regional Filestore share is the minimum for /state; ECS with EFS works fine. The ephemeral-disk default won't — state.yaml and the audit log need to survive a restart.

Required configuration#

These are the env vars CP actually reads. Anything not on this list is either a worker var or doesn't exist; don't invent them.

Env varRequiredWhat it does
AIRLOCK_CP_OPERATOR_TOKENyesBearer for /v1/admin/* (god-mode). Console "operator-token" login also resolves to this until OIDC ships. Constant-time compared with hmac.compare_digest.
AIRLOCK_CP_PUBLIC_URLyesPublic URL CP advertises to workers (https://cp.example.com). CP refuses to start without it. Workers dial wss://…/v1/tunnel here; agents POST <url>/mcp/<tenant_slug>.
AIRLOCK_SESSION_SECRETrecommendedHS256 secret for verifying console-issued session JWTs (Auth.js v5). 32-byte random, shared with the console build. Optional today — without it, only operator-token auth works.
AIRLOCK_CP_CONFIGno, default /state/config.yamlPath to config.yaml. The image's entrypoint seeds it from config.example.yaml on first boot.
AIRLOCK_CP_STATE_PATHno, default <audit-dir>/state.yamlPath to runtime state (tenants, API-key hashes, worker pubkeys).
AIRLOCK_CP_SNAPSHOT_DIRno, default /state/snapshotsDirectory for the per-tenant snapshot meta sidecars.
AIRLOCK_CP_LOG_LEVELno, default INFOStandard Python log level.
AIRLOCK_CP_ALLOWED_ORIGINSnoCORS allow-list (comma-separated). Production: your console origin.
AIRLOCK_CP_METRICS_PORTno, default 9090Prometheus listener; set 0 to disable.
AIRLOCK_CP_MAX_SESSIONS_PER_TENANTno, default 8Hard cap on connected workers per tenant. Misconfigured replicas: 10000 won't OOM CP.
AIRLOCK_DEMO_TENANT_REGISTERno, default true in the imageRegisters the three bundled hosted-demo tenants (t_fintech, t_healthcare, t_hr) on startup so the playground is browseable out of the box. Set false for a clean self-hosted deploy where the only tenants are the ones you create. Legacy alias: AIRLOCK_DEMO_TENANT_ENABLED=false.
AIRLOCK_CP_ALLOW_LOCALHOST_PUBLIC_URLno, default falseLocal dev only. Lets AIRLOCK_CP_PUBLIC_URL be unset / localhost. Never set in production — workers will receive ws://localhost:18080 enrollment payloads.

The cp_private_key_path (default /state/cp_ed25519.pem) and audit_log_path (default /state/audit.jsonl) are read from config.yaml's server: block, not env. The defaults match the fly.toml mount; you usually don't need to touch them.

Operations#

Backups#

Snapshot the /state volume nightly. The state dir contains the audit log and tenant registry; loss means re-onboarding every worker. Fly volumes auto-snapshot daily by default — keep the default for pilot, bump to 30+ days for real enterprise retention.

Upgrades#

Deploy a newer image tag. On Fly, flyctl deploy does a rolling single-machine restart; tunnel sessions auto-reconnect within ~5 seconds. State is forward-compatible within a major version. We post breaking changes in release notes; until 1.0 we'll email self-hosting customers individually.

Rollback#

flyctl releases list -a airlock-cp
flyctl releases revert <version>

State migrations are forward-only on major-version bumps; rolling back a major won't work without restoring a volume snapshot. Within a minor, revert is safe.

Log shipping#

CP writes structured JSON to stdout. Pipe it to whatever log aggregator you already run — Datadog, Splunk, CloudWatch, an OTel collector. The audit log is separate (/state/audit.jsonl, append-only) and is what feeds the console's /audit page over SSE; ship it to your SIEM via the volume, not via stdout.

Monitoring#

/health returns 200 OK when the process is up; point your uptime monitor at it. /v1/tenants is a richer (unauth) status endpoint that returns each tenant's connected-session count — useful as a "workers are still dialed in" probe.

The :9090/metrics Prometheus endpoint exposes airlock_cp_tool_calls_total{tool,outcome}, airlock_cp_tool_latency_seconds{tool}, airlock_cp_active_sessions{tenant_id}, and airlock_cp_hello_total. Set AIRLOCK_CP_METRICS_PORT=0 to disable.

Operator token rotation#

The operator token is the master credential. Rotate every ~90 days and on operator turnover.

NEW_TOKEN="$(openssl rand -hex 32)"
flyctl secrets set AIRLOCK_CP_OPERATOR_TOKEN="$NEW_TOKEN"   # restarts the machine

Then update any place you pasted the old token: the console login form, scripted curl sessions, Terraform cp_operator_token variables, CI secrets. The previous token stops working on the next request after the machine restart.

OIDC sign-in (Google / Okta) is on the roadmap and will drop the shared-secret model — the operator token survives as a break-glass only. See Security § 6 — Operator-side for the full auth model.

Security notes#

  • The operator token is root-equivalent today. It bypasses every org-scope check on /v1/admin/*. Console session cookies derive from it. Treat it like a database password: restrict to a tight list of operators, rotate on turnover, never paste into chat.
  • Volume contents are sensitive. audit.jsonl is append-only but readable; tenant Ed25519 public keys and API-key Argon2id hashes live in state.yaml. If your platform doesn't encrypt volumes at rest, do it at the filesystem layer.
  • CP is stateless re: customer rows. It sees routing metadata and the routed envelope, never SQL or row content stored to disk. The worker is the only thing with database credentials. A CP compromise can't exfiltrate snapshots — it doesn't have them. See Security § 4 for the full data inventory.

Troubleshooting#

  • CP refuses to start, error mentions AIRLOCK_CP_PUBLIC_URL. Either unset or contains localhost / 127.0.0.1. Set it to a real HTTPS URL or, for local dev only, set AIRLOCK_CP_ALLOW_LOCALHOST_PUBLIC_URL=true.
  • Worker can't connect. From the worker host: curl -v $AIRLOCK_CP_PUBLIC_URL/health. If that fails, the worker can't reach CP — check egress, DNS, and TLS chain. The WSS upgrade rides the same connection; HTTPS reachability is a strict prerequisite.
  • Console login fails with the operator token. The token in flyctl secrets list doesn't match what you're typing. Rotate with flyctl secrets set and try again with the new value.
  • HELLO_NACK reason=tenant_session_cap_reached. A worker Deployment scaled past AIRLOCK_CP_MAX_SESSIONS_PER_TENANT (default 8). Either bump the cap or scale workers down.
  • /state out of disk. audit.jsonl grows monotonically (no rotation today — see Security § 12). On Fly: flyctl volumes extend airlock_state --size 10. Per-tenant retention + S3 archival is in-flight.

Next steps#