ADR 0002: Pressure-based scheduling
- Status: accepted (simplified 2026-05-31)
- Date: 2026-05-30
- Issue: #92
- Supersedes: the initial draft of this ADR, which proposed a quorum-aware, 4-dimensional, operator-gated rebalancer with stateful-volume migration hooks. That scope was abandoned; this revision is what actually ships.
A leader-driven, hysteretic, always-on rebalancer that observes per-node CPU + memory pressure, picks the cheapest stateless replica to move off the hottest node when the cluster is meaningfully imbalanced, and uses the existing scheduler move path to relocate it.
Design priority: conservative inaction. JACO targets clusters of a handful of nodes; on a 3–5 node cluster, the operator manually draining a hot node is almost always the right answer, and an aggressive rebalancer is more likely to flap workloads than to relieve real pressure. So the rebalancer runs by default, but the gates are tight enough that on an idle or uniformly busy cluster it never moves anything.
Stateful workloads are out of scope entirely. The rebalancer never moves a replica that owns data — JACO has no way to re-attach that data on a different host, and the orchestrator-side byte-copy alternative (issue #91) was rejected on its own merits. Stateful here means "has a bind mount or docker named volume that holds state the workload needs across restarts" — the rebalancer simply never enumerates those replicas as candidates because their service spec carries the volume reference. There is no "stateful filter" in code; the scorer is stateless-only by construction.
Signals
Composite node pressure computed once per cycle (30 s) per node:
pressure(node) = max(
ewma(cpu_util, window=5m),
ewma(memory_util, window=5m),
)EWMA, not instantaneous — instantaneous CPU spikes are noise and would cause thrashing. Window of 5 minutes is loose enough to ignore short-lived peaks and tight enough to react within a few minutes to a sustained imbalance.
Two dimensions, not four. Disk-io and replica-count were in the original ADR; both were cut. Disk-io has no collector wired and a small-cluster operator hitting disk-io saturation is almost always also hitting CPU or memory pressure. Replica-count-vs-soft-cap is a knob nobody on a handful of nodes will ever tune. If a third dimension genuinely earns its keep later (e.g. per-replica network bytes), grow the struct then.
Data source: a PressureSource interface. The daemon currently wires a
NoopSource that returns "no data" for every node, so the rebalancer is
effectively dormant until a real cgroup v2 collector lands as a follow-up
(see #137). The follow-up is the only thing keeping the rebalancer from
acting today; all the decision logic, gating, and audit plumbing are live.
Thresholds and hysteresis
The rebalancer commits a move only when all of these hold:
max(pressure) >= 0.85for at least2consecutive cycles (~1 minute).max(pressure) - min(pressure) >= 0.25across nodes (cluster is imbalanced, not uniformly busy).post_move_pressure(src) <= max_pressure - 0.10— meaningful relief.post_move_pressure(dst) < 0.75— does not create a new hotspot.- The candidate replica has been on src for at least
cooldown_replica = 10m. - The destination has not received a move in the last
cooldown_node = 2m.
These are the defaults in internal/scheduler/rebalance/config.Config and
they bake in. There is no operator-facing config block — the rebalancer
is on-by-default with no knobs. Tests inject a Config to drive
deterministic cycles; production constructs DefaultConfig() once at boot.
Selection policy
Among replicas on the most-pressured node, the cheapest move wins:
score(replica) = relief_estimate(replica) - move_costwhere:
relief_estimate= the replica's contribution to src's dominant pressure dimension (CPU-dominant hotspot → replica CPU footprint; memory-dominant → replica RSS).move_cost=0.01, a small fixed restart penalty so ties broken by relief alone resolve deterministically.
No stateless/stateful bonus, no priority weighting. The scorer assumes
stateless candidates exclusively; non-stateless replicas are excluded
upstream by virtue of how the rebalancer enumerates candidates from
ReplicasDesired.
Constraints the scorer hard-filters before scoring:
- never violates resource limits at dst (
post_cpu,post_memboth ≤ 1.0) - never violates anti-affinity (
placement: spread/placement: hosts:;placement: globalis never moved by definition).
A replica that fails any hard filter on every candidate dst is not
movable, full stop — the rebalancer logs a SkipNoEligibleDst and moves
on. Not an error.
Quorum modeling was in the original ADR; it was cut. The SPREAD anti-affinity gate already prevents the rebalancer from co-locating two replicas of the same service, which is the actual failure mode quorum modeling was trying to prevent for stateless raft-shaped workloads. A genuine quorum-bearing workload is stateful and therefore not a candidate.
Move execution
Standard reschedule command through the existing internal/scheduler path.
The reconciler does stop-on-src / start-on-dst; the rebalancer just chose
the move.
Concurrency cap: at most one move per cycle, cluster-wide. Defends against avalanche — one move lands, next cycle re-evaluates. Slower convergence, no thrash.
Decision authority
The rebalancer runs on every node but Cycle self-gates on
LeaderStatus.IsLeader(). State it needs (pressure EWMAs, cooldown
timestamps, consecutive-over counters) is leader-local — not in raft,
because losing it on a failover is fine: the new leader rebuilds it from
the next two cycles of pressure samples.
The decision itself (the move command) goes through raft like any other
placement change, so jaco status reflects committed moves.
Observability
Audit log records every committed move (AUDIT_EVENT_TYPE_REBALANCE_MOVED)
and every per-candidate skip
(AUDIT_EVENT_TYPE_REBALANCE_SKIPPED), with the same payload shape:
replica_id, deployment, service, src, dst,
dominant (cpu|memory),
relief, score, move_cost,
src_pressure_before, dst_pressure_before,
src_pressure_after, dst_pressure_after,
reason (only on SKIPPED: cooldown_replica | cooldown_node |
dst_cap | relief_floor | resource_limits | anti_affinity |
no_eligible_dst | no_candidate)The AUDIT_EVENT_TYPE_REBALANCE_DRY_RUN tag (proto field 22) is reserved
but no longer emitted — the dry-run mode was removed when the rebalancer
became always-on.
Component changes
internal/scheduler/rebalance/— the rebalancer package. Cycle loop, pressure aggregation, scorer, hard-filter, audit emission.internal/scheduler/placement/— no changes (rebalancer reuses the existing reschedule path via raft-Apply).internal/controlplane/fsm/— no schema changes; rebalance commits the sameReplicaDesiredUpsertentry the scheduler already uses.internal/daemon/grpc/server.go— always starts the rebalancer goroutine withDefaultConfig()andNoopSource{}. There is no operator config block to wire.- A real cgroup v2
PressureSource(#137) is the only follow-up work needed to make the rebalancer actually fire in production. Without it the loop spins but every gate short-circuits on "no data for this node".
Tests
internal/scheduler/rebalance/pressure_test.go— Composite math, EWMA decay, EWMA spike-damping, backwards-clock invariance.internal/scheduler/rebalance/scorer_test.go— relief estimate, hard filter ordering, anti-affinity per PlacementMode, post-move clamping.internal/scheduler/rebalance/hysteresis_test.go— single-spike does NOT trigger; sustained pressure DOES trigger; dst_cap / relief_floor / both cooldowns / imbalance_gap each block when expected; SKIPPED audit carries the right reason.
Acceptance
- A node driven to sustained
cpu_util > 0.85for >1 minute on a 2+ node cluster with at least 0.25 cross-node imbalance triggers a move of an eligible stateless replica to a cooler node, observable in the audit log. - Under uniform load, no moves happen.
- A move never lands a replica on a dst that would violate its resource
limits or anti-affinity (verified by
TestHardFilter_OrderingAndReasons). - The rebalancer subsystem starts on every daemon with no operator config and produces zero audit events on a noop source (verified by the daemon grpc test suite, which boots a daemon without configuring rebalance).
Out of scope
- Stateful workloads. The rebalancer does not move replicas with attached volumes; #91 was rejected and #135 (remote-mounted volumes) is labeled wontfix-candidate.
- Disk-io and replica-count pressure dimensions. CPU + memory only.
- Operator-tunable config knobs. The defaults bake in.
- Cluster autoscaling and predictive scaling.
- Network-pressure dimension. The current health subsystem doesn't measure per-replica network bytes; if it ever does, add a dimension then.
- Per-deployment opt-out. Operators who want a service immovable use a
pinned
placement: hosts:or attach a volume.