Isolation
JACO enforces two isolation boundaries cluster-wide:
- Cross-deployment — containers in different deployments cannot reach each other. Period.
- Cross-network within a deployment — containers in the same
deployment but on disjoint compose
networks:cannot reach each other.
Both are enforced by L2 bridge separation (one bridge per
(deployment, network); see Networking) and L3
nftables rules in the inet jaco table. Both layers must be
operational on every node before that node is considered ready.
Code: internal/discovery/firewall/.
The no-host-disruption invariant
JACO's ruleset is all-accept by policy. Every base chain is
policy accept; JACO never blanket-drops host ingress or forwarded
traffic it does not own. The operator's other docker networks, host
routing, SSH (on whatever port), a VPN, a VNet, their own host
firewall — all of that is the operator's domain, and a policy-drop
chain would silently trespass on choices JACO cannot know about.
The only packets JACO drops are those flowing between two of its
own container subnets that belong to different (deployment, network)
scopes. This is scoped to a jaco_pool set (the union of every JACO
subnet), so anything outside JACO's own address space is never matched.
Source of truth: render.go
(Render is pure-Go and golden-file tested).
The ruleset
JACO manages a single nftables table, inet jaco. With one workload
deployment front on the default network, the rendered ruleset looks
like:
add table inet jaco
delete table inet jaco
table inet jaco {
set dep_net_front__default {
type ipv4_addr
flags interval
elements = { 10.244.1.0/24 }
}
set jaco_pool {
type ipv4_addr
flags interval
elements = { 10.244.1.0/24 }
}
chain forward {
type filter hook forward priority 0; policy accept;
ct state established,related accept
ip saddr @dep_net_front__default ip daddr @dep_net_front__default accept
ip saddr @jaco_pool ip daddr @jaco_pool drop
}
chain input {
type filter hook input priority 0; policy accept;
}
chain output {
type filter hook output priority 0; policy accept;
}
}The leading add table inet jaco / delete table inet jaco pair is the
atomic-replace prelude (see Atomic reload): add
creates the table if it's absent so the following delete can't fail on
a cold host, and delete drops the entire prior generation so a
re-apply rebuilds the table from scratch instead of appending to it.
Named sets
- One
set dep_net_<dep>_<net>per (deployment, network), holding every host's/24for that scope (per-host /24s, issue #28) so cross-host same-scope traffic matches@seton both saddr and daddr. Names are sanitized to[a-zA-Z0-9_]and hashed when they would exceed nftables' 63-char identifier limit (SetName). set jaco_pool— the union of every JACO subnet. Emitted only when at least one subnet exists. It scopes the cross-network drop.
chain forward
type filter hook forward priority 0; policy accept;
Rules in order:
ct state established,related accept— return path for already-allowed flows.- Per (deployment, network), one rule:
ip saddr @<set> ip daddr @<set> accept— same-(deployment, network) traffic, anywhere in the cluster, regardless of whether the inbound interface is a JACO bridge orwg-jaco. ip saddr @jaco_pool ip daddr @jaco_pool drop— the cross-scope isolation drop, emitted only when subnets exist. Two JACO containers in different scopes both fall injaco_poolbut match no per-set accept, so this rule fires. Anything where either address is outsidejaco_poolfalls through to the accept policy untouched.
chain input
type filter hook input priority 0; policy accept;
No rules. JACO does not police host ingress. WireGuard
(udp/51820), the gRPC API (tcp/7000), the per-bridge DNS responder
(udp/53), and the public Caddy ports (tcp/80,443) are all reachable
because the chain accepts by default — JACO does not add explicit
allows for them. The chain exists only so the table's shape is stable
for drift detection.
chain output
type filter hook output priority 0; policy accept;
JACO does not constrain egress from the host itself.
Why IP sets (not interface matching)
Cross-node traffic arrives via wg-jaco, not a JACO bridge — iifname
matching alone would miss it. IP-set matching, keyed on
(saddr, daddr) both being members of the same
dep_net_<dep>_<net> set, works uniformly for same-node bridge-to-bridge
and for cross-node WG-decrypted paths.
Coexistence with docker
JACO leaves docker's iptables management enabled — it handles NAT for
outbound container traffic and per-bridge defaults; replacing it is out
of scope. JACO's rules live in a separate nftables table; the
scoped jaco_pool cross-set drop composes with docker's parallel
chains because any drop from any chain stops the packet.
Because JACO's chains are all-accept, cross-host container traffic must also survive docker's own isolation drops. The firewall reconciler re-asserts two exemptions in docker-owned chains on every tick (issue #28), gated on the IPAM pool being known:
EnsureSNAT— an intra-pool SNAT exemption in docker's natPOSTROUTINGso pool-to-pool traffic keeps its real source address. A failure auditsSNAT_EXEMPT_FAILED.EnsureOverlay— intra-pool ACCEPT exemptions (rawPREROUTINGdirect-routing +DOCKER-USERinter-network isolation) so WG-decrypted cross-host packets aren't dropped by docker's container-isolation rules. A failure auditsOVERLAY_EXEMPT_FAILED.
These live outside table inet jaco (and outside its self-test), so
they are re-checked best-effort every 30 s independently of the
isolation status.
Atomic reload
JACO renders the full expected ruleset for the node on every
relevant watch event (Subnets, ReplicaObserved, Nodes), debounced at
200 ms, and submits the whole thing as one transaction via nft -f.
Partial state is impossible — either the new ruleset is in place or
the old one still is.
The rendered file leads with add table inet jaco then
delete table inet jaco before the table inet jaco { … } body. This
matters because nft -f appends to an existing chain rather than
replacing it: re-applying the table body on every reconcile would stack
a fresh generation of forward rules onto the live chain each time. An
earlier generation's @jaco_pool … drop then sits ahead of a
later-deployed stack's per-scope accept, shadowing it — silently
breaking cross-host traffic for every deployment except the first
applied (same-host traffic L2-switches within one bridge and never hits
the forward hook, which is what made the bug subtle; issue #89).
Deleting the table first means each apply recreates it from scratch
within the one nft -f transaction, so the chains never accumulate. The
add ahead of the delete keeps the delete from failing on a cold
host where the table doesn't exist yet. The SNAT/overlay exemptions live
in Docker's own nat/raw tables (re-asserted each tick), so flushing
inet jaco doesn't disturb them.
Self-test on startup
After first load, JACO reads back nft -j list table inet jaco and
checks it against the rendered expectation (SelfTestFromJSON): the
three base chains are present with policy accept, one
dep_net_<dep>_<net> set exists per scope, and jaco_pool exists iff
there is at least one subnet. On mismatch:
Error{code: isolation_self_test_failed}is logged + audited.- The daemon does NOT call
sd_notify(READY=1)— systemd holds the unit instarting. Node.statusin raft staysjoining, then transitions toisolation_unavailableonce admission opens; other nodes see this and skip the host for scheduling.
Operator action: confirm the kernel has nftables, the nft binary is
on PATH, and the daemon has CAP_NET_ADMIN. Then systemctl restart jacod to retry.
Drift reconcile
A safety tick runs every 30 s:
nft list table inet jacoand diff against the expected rendered ruleset.- On any drift (operator manually edited, another tool clobbered the
table, transient kernel issue): re-render,
nft -fatomic reload, emit anAuditEvent{type: isolation_ruleset_reconciled, identity: system, payload: {node, diff}}. - If reload fails: transition the node to
isolation_unavailable; cease accepting new container creates. Existing containers continue running — JACO does not amplify damage by tearing them down on drift recovery failure.
Leader-forwarded audit and status (issue #88, #112, #113)
The Audit and UpdateStatus callbacks both write to raft. Direct
node.Apply only succeeds on the leader; on a follower it returns
hraft.ErrNotLeader. The reconciler routes both callbacks through the
applyOrForwardCommand shim:
- on the leader → direct
node.Apply; - on a follower → dial the leader's gRPC address (resolved from
state.Nodes) and callInternal.Submitto apply the same command cluster-wide.
Before this shim, every follower's reconcile audited
ISOLATION_RULESET_RECONCILED and called UpdateStatus, both failed
with ErrNotLeader, and the reconciler logged
Audit(...) failed + firewall.Reconciler.Tick failed even though
the underlying nft -f apply had succeeded. The audit event was also
lost.
A freshly-joined follower's first tick can still race ahead of raft
leader discovery — at that point state.Nodes carries no leader gRPC
address and the forward fails. To suppress the spurious startup-window
errors, Reconciler.ReadyGate is wired to node.Leader() != "". While
the gate returns false, Loop skips Tick and waits for the next
ticker. Steady-state behavior is unchanged once raft settles.
The isolation rig (scripts/test/isolation-rig.sh) exercises this in
CI: it flushes the table out-of-band and asserts the reconcile
restores it within 30 s plus the audit event is recorded.
What containers can and cannot do
A container in deployment front (subnet 10.244.1.0/24) attempting
to reach a container in deployment back (subnet 10.244.7.0/24):
- DNS —
back.some-servicereturns NXDOMAIN immediately (the responder onfront's bridge only knowsfront's services). - By guessed IP — packet enters FORWARD; no per-scope set contains both
10.244.1.xand10.244.7.x, so no per-set ACCEPT matches; both addresses are injaco_pool, so the scoped cross-setdropfires. Same on cross-node attempts.
Multi-network within one deployment behaves identically: the named
sets are per-(deployment, network), not per-deployment, so a frontend
container cannot reach a backend-only service even within the same
deployment unless a bridge service (declared on both networks)
relays.
Compose namespace knobs weaken isolation
Compose accepts a closed allowlist of namespace knobs that JACO
forwards verbatim into docker's HostConfig (issue #118): ipc,
pid, uts, userns_mode, cgroup, cgroup_parent. Host-mode
values (pid: host, ipc: host, uts: host, userns_mode: host)
share the host kernel's namespace with the container and weaken
isolation by design. JACO does not gate them at apply time — an
operator declaring pid: host is presumed to know they are giving the
container visibility into every host process.
The adjacent fields privileged and security_opt are gated
(issue #119). Both require the service to carry
labels: { "jaco.io/allow-privileged": "true" } AND the calling
operator's token to carry allows_privileged=true. See
Auth and tokens → allows_privileged flag
and Supported compose fields → Privileged services.
network_mode is also restricted (issue #121): empty / none /
service:<name> only. host, bridge, container:<id>, and any
named-network value are rejected outright — they would bypass the
per-deployment bridge, the WireGuard mesh, this ruleset, and ingress
all at once.
The bridge / nftables isolation described above still holds: a
container with pid: host sees the host's processes but its network
traffic is still subject to the per-scope set match in chain forward.
Practical implications
- Operator hygiene — do not hand-edit
inet jaco. JACO will reconcile within 30 s, emit an audit event, and you will have to explain it. - CI / dev clusters — the rig requires CAP_NET_ADMIN, CAP_NET_RAW,
kernel WG, nftables, and docker.
make test-isolationruns the rig;make ci-testskips it. See Testing. - Production — every node MUST satisfy the kernel gates. A
partially gated cluster is supported (the affected node sits in
isolation_unavailable, others schedule normally), but operators should fix the gated node before treating the cluster as production-healthy. - Host-kernel surface — compose
devices:(issue #115),privileged:, andhost-mode namespace knobs (pid: host,ipc: host,uts: host,userns_mode: host) all weaken isolation by design. JACO honors them as-written today; an operator-side policy gate (label/selector based) is on the roadmap so deployments that need raw host access can be opted in explicitly per node, without forcing every workload through the same surface.
See also
- Networking
- Status and errors — the
isolation_unavailableandisolation_ruleset_reconciledcodes - Troubleshooting