Architecture#

Airlock sits between your agent and your production database. Instead of giving the agent SQL access to prod, you give it SQL access to a per-snapshot, ephemeral, PII-masked DuckDB that lives inside your VPC.

One-paragraph version: Your worker runs in your VPC, holds the only copy of DATABASE_URL, and exports ephemeral DuckDB snapshots (one per row id from your root_table) with masking applied at export time. The control plane (cp.airlocklabs.ai) relays MCP tool-call envelopes between agents and your worker but never holds your DB credentials, never writes SQL or rows to disk, and only persists routing metadata. The console (console.airlocklabs.ai) is the operator UI on top of CP. For per-term definitions of worker, tenant, snapshot, mask policy, egress block, and Mode A vs B, see the Concepts glossary. (Internal: subject = the row id a snapshot is keyed by — see concepts.)

This page is the mental model in plain English. We use sandbox and snapshot interchangeably in this doc — sandbox is the customer-facing concept (an isolated, ephemeral SQL surface), snapshot is the file on disk (<snapshot_id>.duckdb in tmpfs).

Roles#

Three named roles show up throughout this page. We use these terms consistently — generic "user" is too ambiguous when three different humans touch the system.

  • end-user — the person whose data is in the snapshot (e.g. Alice, the end customer of the company that installed Airlock). Internally the worker also calls this person the subject — the WHO whose row a snapshot was keyed by.
  • operator — the engineer at the customer who installs Airlock, edits airlock.yaml, and runs the worker.
  • caller — the human chatting with the LLM. The caller may or may not be the end-user; a support agent asking about Alice's account is the caller, Alice is the end-user.

The pieces#

There are three processes in the system. Two run on infrastructure the operator controls; one is hosted by Airlock.

ProcessRuns inWhat it owns
WorkerOperator's VPCDB credentials, raw row data, DuckDB snapshots
Control planeAirlock-hostedRouting, audit metadata, API keys
AgentCaller's client (LiteLLM, Cursor, etc.)Bearer API key

The wiring at a glance:

   Operator's VPC                  Airlock-hosted              Caller's client
 ┌────────────────────┐         ┌──────────────────┐        ┌────────────────┐
 │                    │         │                  │        │                │
 │  ┌──────────────┐  │  WSS    │ ┌──────────────┐ │ HTTPS  │  ┌──────────┐  │
 │  │   Worker     │◄─┼─────────┼─┤ Control      │◄┼────────┼──┤  Agent   │  │
 │  │  (tmpfs:     │  │ (out)   │ │ plane        │ │ MCP    │  │ (LLM +   │  │
 │  │  *.duckdb)   │  │         │ │ (routing,    │ │ JSON-  │  │  tools)  │  │
 │  └──────┬───────┘  │         │ │  audit)      │ │ RPC    │  └──────────┘  │
 │         │ SELECT   │         │ └──────────────┘ │        │                │
 │  ┌──────▼───────┐  │         │                  │        └────────────────┘
 │  │  Source DB   │  │         └──────────────────┘
 │  │  (Postgres)  │  │
 │  └──────────────┘  │
 └────────────────────┘

The agent talks MCP JSON-RPC to the control plane. The control plane multiplexes that call over an outbound WebSocket to a worker inside the operator's VPC. The worker runs the SQL against a DuckDB file in tmpfs and returns the result. The production database is never queried by the agent.

Why the worker dials out#

The control plane never reaches into the operator's VPC. The worker dials out over WSS (port 443) and the control plane multiplexes inbound traffic back over the same socket. This is what lets the operator run Airlock without opening any inbound firewall holes.

The worker authenticates to the control plane with an Ed25519 keypair the worker generates locally. Only the public half ever leaves the host; the private key never transits the control plane. We issue single-use enrollment tokens to bootstrap that keypair without anyone having to ship a private key over the wire.

Snapshots are ephemeral#

When the agent issues its first MCP tool call for snapshot_id=alice, the worker checks data_dir for alice.duckdb. If it's missing or older than the configured TTL (snapshot_ttl_s, code default 300s; the hosted demo configs override to 24h to keep cached snapshots warm across deploys), the worker exports a fresh snapshot from the source DB, applies the masking rules from airlock.yaml, and writes the file to data_dir (default /dev/shm/airlock on Linux, /tmp/airlock-snapshots in the Fly demo image, configurable on non-Linux hosts via AIRLOCK_DATA_DIR). Subsequent calls within the TTL window read the warm file in 30–150ms.

A background reaper deletes any snapshot whose age exceeds the TTL. Worker restart wipes everything in tmpfs — the next call re-exports.

This buys three things:

  1. Bounded data lifetime per snapshot. The freshest copy of any one row's data exists only for the duration of an active chat plus TTL. No long-lived "agent's view of the data" sitting on disk.
  2. No persistent disk to manage. Operators size one number (container memory). No PVC, no S3 backend, no stale-cache eviction policy.
  3. Multi-worker is free. Workers share no state, so scaling is just bumping replicas: in the k8s manifest.

What it costs: the first call of every chat session pays the export latency (~1–2s in the demo, can be longer against larger source schemas). Customers we've talked to accept this in exchange for snapshots that never persist past the TTL window.

Multi-worker per tenant#

The control plane allows up to 8 tunnels per tenant by default (AIRLOCK_CP_MAX_SESSIONS_PER_TENANT). Tool calls round-robin across the connected sessions. If a worker drops, in-flight calls return -32001 worker_unavailable (retryable) and the next call lands on a survivor — at the cost of one cold-export hit for affected snapshots.

Because workers share no state, there's no consistent-hash routing on snapshot_id. Worker-1 has no warmer state for Alice than Worker-2 once a few minutes pass. Round-robin is simpler and equally efficient.

What crosses each boundary#

BoundaryInsideOutsideWhat crosses
Customer VPCSource DB, worker, snapshots, DB credentials, worker private keyEverything elseOutbound WSS to CP: tool envelopes + heartbeats. TLS to source DB: SELECT queries.
Control planeArgon2id-hashed API keys, Ed25519 keys, audit metadataAgent-side, customer-sideHTTPS to agents (MCP JSON-RPC). WSS to workers (REQUEST/RESPONSE for tool calls; ADMIN_REQUEST for operator config writes).
AgentBearer API keyCustomer dataMCP only — no DB access, no filesystem access.

The control plane sees routing metadata (tool name, snapshot id, latency, outcome) and the JSON-RPC envelope it routes to the worker. In Mode A (the deployed default) the envelope payload is plaintext — CP holds SQL text and result rows in process memory during a request and discards them on response. Nothing payload-related is written to disk or to the audit log. Mode B (Concepts → Mode A vs Mode B) makes CP a payload-opaque relay; envelope shape is shipped, key exchange + client shim are pending.

Two channels on one tunnel#

The same WebSocket carries two distinct kinds of traffic:

  • REQUEST / RESPONSE frames — LLM tool calls. Each REQUEST carries a CP-signed forwarded JWT bound to a tenant + API key + snapshot + tool. The worker re-validates every claim.
  • ADMIN_REQUEST / ADMIN_RESPONSE frames — operator-only config reads/writes from the console. Each carries a different JWT shape with admin: true claim and an op (config_get / config_put) bound to the frame. The worker's tool dispatcher refuses to run admin ops, and the admin verifier explicitly rejects tool tokens. An LLM API key cannot reach config_put under any failure mode.

Snapshots in tmpfs — a security feature#

The fact that snapshots live in RAM and not on disk is load-bearing for the security story:

  • Process exit (crash, eviction, decommission) wipes every snapshot atomically. The OS reclaims tmpfs pages; there's no .duckdb left on a stopped pod's volume for a forensic image to recover.
  • A worker restart is an automatic full snapshot rotation.
  • The "snapshots never persist past the chat" claim in the security whitepaper is enforced by the OS, not just policy.

Where the audit lives#

Every tool call writes one JSONL record to the control plane's audit log: tenant, API key, worker, tool, snapshot id, outcome, latency, trace id. SQL text and row data are never written. The audit log is the system of record for "who asked what about whom, and what happened."

What's not in the picture#

  • No LLM gateway. Bring your own LiteLLM (or Cursor, or any MCP-speaking client). We're a tool server, not model infrastructure. See the security whitepaper for why we don't put ourselves in the prompt path.
  • No persistent snapshot store. No PVC, no S3 backend. Re-export on miss is the recovery path; the worker never depends on snapshot durability.
  • No cross-region replication of the control plane. Single-region for now; documented as a roadmap item.