Self-hosting the control plane#
See also: Self-host quickstart (docker compose) — one-command CP + console stack for local eval. This page is the production guide; that page is the fastest way to see it working. Both are linked from the main quickstart.
You have two ways to run Airlock's control plane: let us host it
(cp.airlocklabs.ai), or run the same image yourself, next to your
worker. Both produce the same product surface — same tunnel
handshake, same MCP edge, same audit shape — and both keep your
worker (and therefore your DB credentials and raw rows) inside
your VPC. The only thing that changes is who runs the CP service.
This page is the fork: when to pick which, and how to actually deploy it if you pick self-hosted.
Hosted vs. self-hosted#
For most readers, hosted is the right pick. Self-host when you have a regulatory or contractual reason that says routing metadata can't leave your cloud account, OR when you need a single-tenant deploy you can audit and freeze yourself.
Hosted CP (cp.airlocklabs.ai) | Self-hosted CP (your VPC / your cloud account) | |
|---|---|---|
| Who runs it | Airlock | You |
| What sees customer rows | Worker only — never CP | Same |
| What CP sees | Routing metadata + audit trail + snapshot meta | Same |
| Onboarding | Operator token in your inbox, ~5 min | Provision an image, deploy, point DNS, ~1–2 hours |
| Upgrade cadence | Automatic (we ship security patches) | You pull new image versions on your schedule |
| Compliance fit | OK for most | Required for data-residency, FedRAMP, single-tenant audit guarantees |
| Cost (your side) | Free during private beta | Compute + storage + ops time |
The data-boundary story is identical either way. The worker stays
in your VPC, holds the only DATABASE_URL, exports masked DuckDB
snapshots locally, and dials out to CP over WSS. CP never opens
an inbound connection to your network and never sees raw rows.
See Architecture
and Concepts → Self-hosted vs hosted
for the full picture.
What self-hosting actually buys you:
- Routing metadata stays in your account. Tool name, snapshot id, latency, outcome — the audit log shape from Security § 7 — sits on a volume you control rather than ours.
- Single-tenant runtime. No co-tenancy with other customers' metadata.
- Frozen versions. You decide when to pull a new image.
What it does not buy you:
- Different exposure of customer rows. Both modes route through the same worker; the worker is what protects rows, not the CP.
- A different cryptographic boundary today. Mode A (the default) has the CP relaying SQL + masked results in process memory; that's true whether the CP is on Fly under our account or yours. Mode B (end-to-end encrypted, CP becomes payload-opaque) is on the roadmap and applies equally to both deploy modes — see Concepts → Mode A vs Mode B.
If those tradeoffs don't match a constraint you actually have, take hosted and keep moving.
Prerequisites for self-hosting#
- Container image. The
airlockai/control-panerepo is private during the beta; we ship a signed image artifact as part of onboarding. Email hello@airlocklabs.ai to get registry access. There is no public DIY path today. - A reachable HTTPS URL for
AIRLOCK_CP_PUBLIC_URL. Workers dial it via WSS (wss://…/v1/tunnel); agents call<public_url>/mcp/<tenant_slug>over HTTPS. The CP refuses to start without it — seeapp.py::_require_public_url. - TLS termination. Fly + Cloudflare DNS-only handle this for free; for bare VM use Caddy or nginx in front. Cloudflare's orange-cloud proxy will interfere with the WSS tunnel — keep DNS records grey-cloud.
- Persistent volume, ~3 GB minimum. Holds
config.yaml,state.yaml,audit.jsonl, the CP's Ed25519 keypair (cp_ed25519.pem), and the snapshot meta sidecars undersnapshots/. Loss = re-onboard every worker. - Egress allow-list. CP makes outbound calls only to operator
OIDC providers (when wired) and your log shipping target. No
outbound DB calls — that's the worker's job. CP has no
DATABASE_URLfield and rejects one inairlock.yamlif a worker ships it.
Reference deploy: Fly.io#
Fly is what we use ourselves. One machine, one volume. Five commands from a clean checkout to a running CP:
flyctl launch --no-deploy --copy-config --name airlock-cp --region iad
flyctl volumes create airlock_state --size 3 --region iad --yes
flyctl secrets set AIRLOCK_CP_OPERATOR_TOKEN="$(openssl rand -hex 32)"
flyctl secrets set AIRLOCK_CP_PUBLIC_URL="https://cp.example.com"
flyctl deploy
Line by line:
launch --no-deploy— creates the Fly app from the committedfly.tomlbut doesn't boot the machine yet. We need the volume + secrets in place first.volumes create airlock_state --size 3— backs the/statemount. 3 GB is plenty for the audit log; resize live later withflyctl volumes extend.AIRLOCK_CP_OPERATOR_TOKEN— the master credential for/v1/admin/*. Treat it like a root password. Rotate by setting a new value and redeploying; the old token stops working on the next request.AIRLOCK_CP_PUBLIC_URL— the HTTPS URL the CP advertises to workers. Without it CP refuses to start.flyctl deploy— first boot seedsconfig.yamlfromconfig.example.yaml, generatescp_ed25519.pem, opens the tunnel listener.
The minimal fly.toml we ship looks like this:
app = "airlock-cp"
primary_region = "iad"
[build]
dockerfile = "Dockerfile"
[env]
AIRLOCK_CP_CONFIG = "/state/config.yaml"
AIRLOCK_CP_STATE_PATH = "/state/state.yaml"
AIRLOCK_CP_SNAPSHOT_DIR = "/state/snapshots"
AIRLOCK_CP_LOG_LEVEL = "INFO"
[mounts]
source = "airlock_state"
destination = "/state"
initial_size = "3gb"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = "off" # long-lived WS tunnels from workers
min_machines_running = 1
[[http_service.checks]]
interval = "30s"
method = "GET"
path = "/health"
timeout = "5s"
[[vm]]
memory = "512mb"
cpu_kind = "shared"
cpus = 1
Two things about this shape worth flagging:
auto_stop_machines = "off". Workers hold a long-lived WSS tunnel; a stopped machine drops every connected worker.- One replica. Runtime state lives on a single Fly volume rather than Postgres. Fine for pilot; multi-region requires migrating state to Postgres first. See Security § 12.
After deploy, point a custom domain at the Fly app:
flyctl certs create cp.example.com
flyctl certs show cp.example.com # follow the CNAME / TXT instructions
Set the Cloudflare CNAME for cp to airlock-cp.fly.dev with
DNS-only (grey cloud) — orange-cloud will break the WS handshake.
Verify:
curl https://cp.example.com/health # → ok
curl -H "Authorization: Bearer $AIRLOCK_CP_OPERATOR_TOKEN" \
https://cp.example.com/v1/admin/tenants # → {"tenants": []}
Other deploy targets#
The image ships a single Python/Starlette process listening on
:8080. Anywhere you can run a container with a persistent volume
will work; the shapes below are sketches, not certified runbooks.
Docker Compose#
Fine for a single-VM deploy or a homelab pilot. TLS terminates in front (Caddy / nginx).
services:
cp:
image: <your-registry>/airlockai/control-pane:<tag>
restart: unless-stopped
ports:
- "8080:8080"
environment:
AIRLOCK_CP_OPERATOR_TOKEN: "${AIRLOCK_CP_OPERATOR_TOKEN}"
AIRLOCK_CP_PUBLIC_URL: "https://cp.example.com"
AIRLOCK_CP_CONFIG: "/state/config.yaml"
AIRLOCK_CP_STATE_PATH: "/state/state.yaml"
AIRLOCK_CP_SNAPSHOT_DIR: "/state/snapshots"
volumes:
- airlock_state:/state
volumes:
airlock_state:
Kubernetes#
A Deployment (replicas: 1), a PVC for /state, a Service on 8080,
and an Ingress doing TLS termination is everything you need. The
manifest is straightforward — no operator, no CRDs. We don't ship a
Helm chart yet because we'd rather not maintain a half-baked one;
ping hello@airlocklabs.ai if you want a starter chart for your
cluster shape and we'll send what we have.
A few constraints worth knowing if you're writing the manifest yourself:
replicas: 1until state moves to Postgres. The volume is single-writer.- Don't put a Cloudflare-style L7 proxy in front. Nginx / Traefik / Istio with WS support is fine. Cloudflare's orange-cloud is not.
/healthis GET 200 OK. Use it for both readiness and liveness.
Bare VM with systemd#
If you already have a hardened VM with TLS in front (Caddy, nginx,
or an ALB), the smallest deploy is a single docker run behind a
systemd unit:
docker run -d --name airlock-cp \
-p 127.0.0.1:8080:8080 \
-e AIRLOCK_CP_OPERATOR_TOKEN="$(cat /etc/airlock/op_token)" \
-e AIRLOCK_CP_PUBLIC_URL="https://cp.example.com" \
-v /var/lib/airlock/state:/state \
--restart unless-stopped \
<your-registry>/airlockai/control-pane:<tag>
Caddy in front of it gets you free TLS:
cp.example.com {
reverse_proxy 127.0.0.1:8080
}
AWS ECS / GCP Cloud Run#
Both work as long as the runtime supports a persistent volume.
Cloud Run gen2 with a Cloud Storage FUSE mount or a regional Filestore
share is the minimum for /state; ECS with EFS works fine. The
ephemeral-disk default won't — state.yaml and the audit log need
to survive a restart.
Required configuration#
These are the env vars CP actually reads. Anything not on this list is either a worker var or doesn't exist; don't invent them.
| Env var | Required | What it does |
|---|---|---|
AIRLOCK_CP_OPERATOR_TOKEN | yes | Bearer for /v1/admin/* (god-mode). Console "operator-token" login also resolves to this until OIDC ships. Constant-time compared with hmac.compare_digest. |
AIRLOCK_CP_PUBLIC_URL | yes | Public URL CP advertises to workers (https://cp.example.com). CP refuses to start without it. Workers dial wss://…/v1/tunnel here; agents POST <url>/mcp/<tenant_slug>. |
AIRLOCK_SESSION_SECRET | recommended | HS256 secret for verifying console-issued session JWTs (Auth.js v5). 32-byte random, shared with the console build. Optional today — without it, only operator-token auth works. |
AIRLOCK_CP_CONFIG | no, default /state/config.yaml | Path to config.yaml. The image's entrypoint seeds it from config.example.yaml on first boot. |
AIRLOCK_CP_STATE_PATH | no, default <audit-dir>/state.yaml | Path to runtime state (tenants, API-key hashes, worker pubkeys). |
AIRLOCK_CP_SNAPSHOT_DIR | no, default /state/snapshots | Directory for the per-tenant snapshot meta sidecars. |
AIRLOCK_CP_LOG_LEVEL | no, default INFO | Standard Python log level. |
AIRLOCK_CP_ALLOWED_ORIGINS | no | CORS allow-list (comma-separated). Production: your console origin. |
AIRLOCK_CP_METRICS_PORT | no, default 9090 | Prometheus listener; set 0 to disable. |
AIRLOCK_CP_MAX_SESSIONS_PER_TENANT | no, default 8 | Hard cap on connected workers per tenant. Misconfigured replicas: 10000 won't OOM CP. |
AIRLOCK_DEMO_TENANT_REGISTER | no, default true in the image | Registers the three bundled hosted-demo tenants (t_fintech, t_healthcare, t_hr) on startup so the playground is browseable out of the box. Set false for a clean self-hosted deploy where the only tenants are the ones you create. Legacy alias: AIRLOCK_DEMO_TENANT_ENABLED=false. |
AIRLOCK_CP_ALLOW_LOCALHOST_PUBLIC_URL | no, default false | Local dev only. Lets AIRLOCK_CP_PUBLIC_URL be unset / localhost. Never set in production — workers will receive ws://localhost:18080 enrollment payloads. |
The cp_private_key_path (default /state/cp_ed25519.pem) and
audit_log_path (default /state/audit.jsonl) are read from
config.yaml's server: block, not env. The defaults match the
fly.toml mount; you usually don't need to touch them.
Operations#
Backups#
Snapshot the /state volume nightly. The state dir contains the
audit log and tenant registry; loss means re-onboarding every
worker. Fly volumes auto-snapshot daily by default — keep the
default for pilot, bump to 30+ days for real enterprise retention.
Upgrades#
Deploy a newer image tag. On Fly, flyctl deploy does a rolling
single-machine restart; tunnel sessions auto-reconnect within ~5
seconds. State is forward-compatible within a major version. We
post breaking changes in release notes; until 1.0 we'll email
self-hosting customers individually.
Rollback#
flyctl releases list -a airlock-cp
flyctl releases revert <version>
State migrations are forward-only on major-version bumps; rolling back a major won't work without restoring a volume snapshot. Within a minor, revert is safe.
Log shipping#
CP writes structured JSON to stdout. Pipe it to whatever log
aggregator you already run — Datadog, Splunk, CloudWatch, an OTel
collector. The audit log is separate (/state/audit.jsonl,
append-only) and is what feeds the console's /audit page over
SSE; ship it to your SIEM via the volume, not via stdout.
Monitoring#
/health returns 200 OK when the process is up; point your uptime
monitor at it. /v1/tenants is a richer (unauth) status endpoint
that returns each tenant's connected-session count — useful as a
"workers are still dialed in" probe.
The :9090/metrics Prometheus endpoint exposes
airlock_cp_tool_calls_total{tool,outcome},
airlock_cp_tool_latency_seconds{tool},
airlock_cp_active_sessions{tenant_id}, and
airlock_cp_hello_total. Set AIRLOCK_CP_METRICS_PORT=0 to disable.
Operator token rotation#
The operator token is the master credential. Rotate every ~90 days and on operator turnover.
NEW_TOKEN="$(openssl rand -hex 32)"
flyctl secrets set AIRLOCK_CP_OPERATOR_TOKEN="$NEW_TOKEN" # restarts the machine
Then update any place you pasted the old token: the console login
form, scripted curl sessions, Terraform cp_operator_token
variables, CI secrets. The previous token stops working on the
next request after the machine restart.
OIDC sign-in (Google / Okta) is on the roadmap and will drop the shared-secret model — the operator token survives as a break-glass only. See Security § 6 — Operator-side for the full auth model.
Security notes#
- The operator token is root-equivalent today. It bypasses every
org-scope check on
/v1/admin/*. Console session cookies derive from it. Treat it like a database password: restrict to a tight list of operators, rotate on turnover, never paste into chat. - Volume contents are sensitive.
audit.jsonlis append-only but readable; tenant Ed25519 public keys and API-key Argon2id hashes live instate.yaml. If your platform doesn't encrypt volumes at rest, do it at the filesystem layer. - CP is stateless re: customer rows. It sees routing metadata and the routed envelope, never SQL or row content stored to disk. The worker is the only thing with database credentials. A CP compromise can't exfiltrate snapshots — it doesn't have them. See Security § 4 for the full data inventory.
Troubleshooting#
- CP refuses to start, error mentions
AIRLOCK_CP_PUBLIC_URL. Either unset or containslocalhost/127.0.0.1. Set it to a real HTTPS URL or, for local dev only, setAIRLOCK_CP_ALLOW_LOCALHOST_PUBLIC_URL=true. - Worker can't connect. From the worker host:
curl -v $AIRLOCK_CP_PUBLIC_URL/health. If that fails, the worker can't reach CP — check egress, DNS, and TLS chain. The WSS upgrade rides the same connection; HTTPS reachability is a strict prerequisite. - Console login fails with the operator token. The token in
flyctl secrets listdoesn't match what you're typing. Rotate withflyctl secrets setand try again with the new value. HELLO_NACK reason=tenant_session_cap_reached. A worker Deployment scaled pastAIRLOCK_CP_MAX_SESSIONS_PER_TENANT(default 8). Either bump the cap or scale workers down./stateout of disk.audit.jsonlgrows monotonically (no rotation today — see Security § 12). On Fly:flyctl volumes extend airlock_state --size 10. Per-tenant retention + S3 archival is in-flight.
Next steps#
- Quickstart → Step 2 — Log into the console picks up where this page leaves off.
- Architecture for the mental model the rest of the docs assume.
- Security for the threat model, cryptography, and data-handling table.