Benchmarking methodology — fair orchestrator comparison

Reference for how this harness measures JACO vs Kubernetes / k3s / Docker Swarm, and why. The bulk of this document is a synthesized literature/industry review (see Provenance below); the section immediately after this preface maps its findings onto what the harness actually does today and what's still open.

How the harness applies this (status)

Implemented

Repetitions + statistics — run.sh --repeat N takes N measured samples and records every sample plus mean / stdev / 95% CI (load.stats). scorecard.sh prints the rps mean with its ±95% CI; overlapping intervals mean a statistical tie, not a winner. Single-run point estimates are no longer over-read.
Warm-up vs steady state — a discarded warm-up load (BENCH_WARMUP_DURATION, default 20s) precedes the measured samples so caches/JITs are warm.
Load-generator monitoring — the generator host's CPU is sampled around each run; load_generator.cpu_pct_max + a saturated flag (>80%) catch the classic pitfall of mistaking the generator's ceiling for the stack's.
Control-plane overhead proxy — idle node memory + load average are snapshotted post-deploy/pre-load (overhead.*); with an identical idle workload, the cross-stack delta approximates each control plane's fixed footprint.
Identical workload/images/limits, neutral substrate, harmonized HTTPS ingress, re-provision between stacks — already in place (see RUBRIC.md).

Open (tracked for the comparative run / future work)

Scheduling latency, scale-out time, and node-failure recovery as first-class measured dimensions (where JACO's raft control plane and the #65/#66 fixes show up).
Cold-vs-warm image-cache as an explicit, symmetric experiment (today TTL and bootstrap time can be contaminated by image-layer cache reuse).
Networking harmonization — ingress terminators still differ per stack (Caddy vs Traefik vs ingress-nginx); documented as a disclosed bias.
One clean 4-way run — all stacks fresh, back-to-back, on one bed, with randomized stack order across repetitions.

Provenance

Synthesized via Perplexity sonar-deep-research on 2026-05-26 from 45 academic and industry sources (incl. the COFFEE benchmarking framework, edge-orchestration evaluations, a BlueField-DPU benchmarking thesis, k6 large-test guidance, OpenShift/etcd performance docs, and Docker Swarm overlay-latency reports). It is a methodology reference, not a measurement of our specific stacks — treat its quantitative claims as background, and our own results/ as ground truth.

Methodology and Best Practices for Fair Performance Benchmarking of Kubernetes and Docker Swarm

Modern container orchestration systems such as Kubernetes and Docker Swarm have become critical infrastructure for cloud-native applications, yet quantitatively fair performance comparisons between them remain surprisingly rare and difficult to conduct correctly. Container orchestration performance is multidimensional, spanning control-plane latency, pod and container startup behavior, throughput and tail latency of user workloads, resource overhead, scaling characteristics, resilience to node failures, networking behavior, and image-pull strategies. At the same time, benchmarking is fraught with threats to validity, including configuration drift, different defaults, cache effects, load-generator saturation, and inadequate statistical treatment of results.[42][33] This report synthesizes academic and industrial knowledge on container orchestration benchmarking, defines the key metrics and dimensions that matter, analyzes how to design unbiased experiments on identical hardware, maps suitable tools such as wrk, k6, Fortio, Vegeta, kube-burner, ClusterLoader2, and sysbench to specific measurement goals, catalogs common pitfalls that invalidate comparisons, and reviews notable research and practitioner studies on Kubernetes versus Docker Swarm and other orchestrators.[2][5][21][15][31][37] The emphasis throughout is on concrete, actionable guidance for building an automated, reproducible benchmark harness capable of delivering fair, statistically defensible comparisons between Kubernetes and Docker Swarm on the same physical substrate.

Container Orchestration and the Challenge of Fair Benchmarking

Conceptual background: what is being benchmarked?

Container orchestration refers to the automated management of deployment, scheduling, scaling, networking, and life-cycle control for containerized applications across a cluster of machines.[27] In practice this means that a control plane continuously reconciles desired state, described in declarative manifests, with actual state in the cluster, by placing containers on nodes, restarting failed instances, scaling replicas, and managing service discovery and networking. Kubernetes is the dominant open-source container orchestration engine, originally inspired by Google’s internal Borg system, and is used to orchestrate fleets of containers representing applications decoupled from underlying machines.[21][27] Docker Swarm, in contrast, is a Docker-native orchestration mode built directly into the Docker Engine, using a Raft-based manager cluster for state replication and providing a simpler, more tightly integrated orchestration model.[21][45] Both platforms allow users to manage containers and scale application deployment, but they differ significantly in control-plane architecture, networking model, ecosystem maturity, and operational complexity.[1][3][21][15][1]

Because orchestrators sit between applications and infrastructure, their performance cannot be reduced to a single scalar metric. Kubernetes, for example, has a multi-component control plane that includes the API server, etcd as a distributed key–value store, the scheduler, and controllers, along with per-node agents such as kubelet and kube-proxy.[20][29][43] Each component introduces overheads and latencies that propagate to user-visible metrics such as request throughput and tail latency. Docker Swarm’s architecture is simpler but still relies on a Raft consensus group of managers that maintain cluster state and an overlay networking layer that can introduce measurable latency and throughput penalties under high load.[39][45] Benchmarking an orchestrator therefore requires careful decomposition of the end-to-end system into layers: hardware, operating system and container runtime, orchestrator control plane and node agents, cluster networking and service discovery, and finally the application workloads themselves.

The central difficulty in comparing Kubernetes with Docker Swarm on identical hardware is disentangling orchestrator-specific behavior from everything else. Cloud providers impose rate limits, instance variability, and network jitter that can dominate observables if not controlled.[20][29] Caches at many layers—from disk and container image caches to HTTP caches and load balancer session caches—dramatically change performance over time, especially during warm-up.[25][34] Network overlays and service discovery components such as CoreDNS in Kubernetes or IPVS-based virtual IPs in Swarm introduce different overhead profiles that may be sensitive to topology and traffic patterns.[39][44] Moreover, benchmarks inherently exercise both control operations (such as scheduling and scaling) and data-plane operations (such as request handling and packet forwarding), yet many traditional storage and systems benchmarks focus solely on data-path throughput while ignoring control operations.[42] Evaluating an orchestrator demands attention to both.

Academic work on container orchestration benchmarking has begun to formalize this landscape. Systematic literature reviews of container orchestration system (COS) testing identify functionality, resiliency, performance, security, and observability as key testing objectives, and emphasize that performance evaluation must account for both functional correctness and non-functional properties under realistic distributed workloads.[5][5] Specialized frameworks such as COFFEE have been proposed to systematically benchmark orchestrators like Kubernetes and Nomad across self-hosted and cloud environments using controlled experiments and automated harnesses.[24] Similarly, studies evaluating container orchestration at the network edge highlight the importance of scheduling behavior, dynamic and unstable environments, and latency-sensitive metrics for distributed services.[37][37] These works provide conceptual guidance but typically do not offer turnkey methodologies for practitioners who need to compare Kubernetes with Docker Swarm on a specific hardware platform.

Industry comparisons of Kubernetes and Docker Swarm often focus on feature sets and operational complexity rather than detailed performance metrics. IBM, Platform9, SUSE, and others describe Kubernetes as more extensible, feature-rich, and suited for large-scale deployments, while Swarm is portrayed as simpler, easier to set up, and sometimes faster to scale small clusters due to a lighter-weight control plane.[1][3][21][15][1] However, these claims are rarely backed by controlled, reproducible performance experiments on identical hardware. Blogs may mention that Kubernetes has built-in horizontal autoscaling features and more advanced networking, while Swarm emphasizes quick scaling and tight integration with Docker, but they seldom report quantified differences in scheduling latency, throughput under high load, or tail latency distributions for real workloads.[1][3][21][15][1] As a result, practitioners seeking evidence-based guidance must design their own experiments and benchmarking harnesses.

Designing such a harness is nontrivial. Benchmarking research emphasizes that for results to be comparable and reproducible, workload deployment must be fair and repeatable, and the measurement environment must be tightly controlled.[33][42] To this end, some efforts propose “hermetically sealed” benchmark containers that encapsulate all necessary software, dependencies, and configuration, allowing the same benchmark image to run on different systems with minimal variability in the benchmark stack.[33] This philosophy is directly applicable to comparing Kubernetes and Swarm: one should encapsulate application workloads and load generators in standardized containers, while varying only the orchestration layer and its configuration. The remainder of this report builds on these insights to specify, in detail, what to measure, how to measure it fairly, which tools to use, and how to avoid the common pitfalls that often invalidate orchestrator comparisons.

Key Performance Dimensions and Metrics for Orchestrator Comparison

Throughput and latency of application workloads, including tail latency

For most users, the primary concern is how an orchestrator affects the throughput and latency of their applications. Throughput can be expressed as requests per second (RPS) for HTTP or RPC calls, or as tokens per second (TPS) in the context of large language models, but in all cases it represents the number of work units completed per unit time.[10][11][26][36] Latency is the response time experienced by individual requests, and its distribution—especially the tail—matters greatly for user experience. While averages are often reported, they can be misleading in the presence of skewed distributions and rare but severe outliers. It is therefore standard best practice to report multiple latency percentiles, at minimum the median (p50), high percentiles such as p95, and tail latency such as p99.[26]

P99 latency, the 99th percentile of response times, is particularly important because it answers the question of how slow the slowest one percent of requests are under normal operation.[26] If a service has a p99 latency of 200 ms, ninety-nine percent of requests complete within 200 ms, while the remaining one percent are slower, potentially much slower.[26] P99 is considered a measure of tail latency, capturing systematic problems that affect a small fraction of requests, such as occasional cache misses, lock contention, or infrequent but expensive database queries.[26] Monitoring only the average or median would obscure such issues, whereas high percentiles reveal inconsistency. High-volume systems often report p50, p90, p95, p99, and even p99.9 latencies to fully characterize the response-time distribution.[26] This is directly relevant for orchestrator comparisons: if Kubernetes and Swarm show similar median latencies but Kubernetes has significantly lower p99, then for latency-sensitive workloads Kubernetes may be preferable even if average throughput appears similar.

In load testing of HTTP services, tools such as wrk, Fortio, Vegeta, and k6 typically produce throughput and latency statistics, including percentiles.[8][9][10][11] Wrk can generate substantial load on a single multi-core machine using a multithreaded design and event-based I/O, and it reports metrics such as requests per second and latency distributions, which can be used to derive p50, p95, and p99.[8] Fortio is explicitly designed to run at a specified queries-per-second (QPS) rate and records a histogram of execution times, from which it calculates percentiles including p99, and also tracks achieved QPS and error rates.[10] Vegeta similarly generates load at a constant request rate and produces latency histograms and percentiles.[11] K6, while oriented toward scripting and integrating into CI/CD pipelines, also reports response-time percentiles and throughput metrics; its documentation emphasizes careful tuning of the load-generator host so that CPU and network resources do not limit the generated load.[9] For an orchestrator comparison, the application under test would be deployed identically on Kubernetes and Swarm, and identical load patterns would be driven from an external load generator, while throughput and p50/p95/p99 latency are measured and compared.

Large language model serving introduces additional metrics such as time-to-first-token (TTFT), inter-token latency (ITL), and end-to-end latency (E2E), but these can be viewed as refinements of general throughput and latency concepts.[36] E2E latency is the total time from request reception to completion of the response, while TTFT is the time until the first output token is produced, and ITL measures the gaps between subsequent tokens.[36] System throughput can be expressed as tokens per second across all concurrent users, and request throughput as requests per second.[36] While these metrics are specialized, they illustrate the need to decompose latency into phases and to monitor both per-request behavior and aggregate throughput under concurrent load. Orchestrator comparisons for LLM workloads would report TTFT and E2E latency percentiles, as well as system TPS and RPS, under identical model and hardware configurations, to assess whether orchestration overheads or networking differences have measurable impact on streaming behavior.[36]

In summary, when comparing Kubernetes to Docker Swarm, a core metric family should include achieved throughput (RPS or QPS) for representative workloads, together with latency distributions from p50 through at least p99. Applications should be instrumented or tested via load-testing tools that can produce detailed histograms, and monitoring tools such as Prometheus or Datadog can be configured to compute and visualize percentile latencies over sliding windows.[26] Care must be taken to ensure that the load generator itself is not the bottleneck, which requires monitoring CPU, memory, and network utilization on the generator host as recommended in the k6 documentation, and selecting instance types with sufficient network bandwidth and CPU capacity.[8][9][10][11]

Control-plane and scheduling latency

Beyond data-plane performance, orchestrator control-plane latency is crucial, particularly in dynamic environments where pods are frequently created, scaled, or rescheduled. In Kubernetes, the kube-scheduler observes unscheduled Pods and decides on target nodes based on resource availability, constraints, and policies.[17][20] The time from when a Pod is first observed by the scheduler to when it is assigned to a node is commonly referred to as scheduling latency. Kubernetes exposes a metric called PodSchedulingDuration that measures this interval, enabling operators and researchers to quantify scheduler performance.[13][13] Monitoring this metric under varying cluster sizes and workloads reveals how scheduling latency scales, which is central to any fair comparison with Swarm’s scheduling behavior.[13][20][13]

Control-plane latency is influenced by the performance of etcd, the distributed key–value store that backs Kubernetes cluster state.[43] Etcd’s write-ahead log (WAL) persistence and replication introduce their own latency metrics, such as etcd_disk_wal_fsync_duration_seconds, which captures the latency of fsync operations flushing its log entries to disk, and etcd_network_peer_round_trip_time_seconds, which measures the round-trip time for replicating client requests between etcd members.[29][43] Guidelines from OpenShift and Sysdig emphasize that for healthy large clusters, the 99th percentile of etcd network peer latency should remain below roughly 50 ms, and WAL fsync latencies should be kept low through appropriate storage hardware and IO configuration.[29][43] When benchmarking Kubernetes, one should monitor these etcd metrics to detect whether etcd itself becomes a bottleneck at larger scales or under intense control-plane activity, such as rapid scaling of deployments.

Kubernetes control-plane performance also depends on API server throughput and latency, controller responsiveness, and service discovery latencies via CoreDNS. CoreDNS metrics, such as query rate, error rate, and query latency, provide insight into how quickly service discovery lookups are resolved, which affects pod-to-pod communication and initial connection establishment.[44] The Kubernetes latency guide highlights how various factors—including networking, resource contention, and Node pressure—can induce request delays at different layers of the system.[4] Collectively, these control-plane and observability metrics complement application-level throughput and latency, and any orchestrator comparison should incorporate them to avoid misattributing performance limitations.

Docker Swarm’s control plane is built around a Raft consensus module which replicates cluster state across managers, tolerating up to ((N-1)/2) failures in a cluster of (N) managers and requiring a majority quorum of ((N/2)+1) to agree on updates.[45] Raft’s leader election and log replication introduce their own latency characteristics, especially during cluster reconfiguration or when failures occur. While Swarm does not expose as rich a metric set as Kubernetes by default, benchmarking should instrument Swarm managers to measure request processing latency (for example, via Docker APIs), Raft commit latency when available, and the time from service scale-up or creation to task assignment. Fair comparison requires mapping conceptually similar control-plane operations between the two orchestrators: Pod creation and scheduling in Kubernetes versus service deployment and task placement in Swarm.

Overall, control-plane and scheduling metrics to be compared include Kubernetes PodSchedulingDuration, API server request latency, etcd WAL fsync and peer RTT percentiles, CoreDNS query latency, Swarm manager and Raft commit latencies where accessible, and the time from control-plane operation request (e.g., scaling a deployment) to the point where new containers begin starting on nodes. These metrics reveal how quickly each orchestrator reacts to desired-state changes and how its control plane scales with cluster size and activity.

Pod/container startup latency and time-to-ready

End users and higher-level systems often care about how quickly new instances of an application become ready to serve traffic. In Kubernetes, pod startup latency is defined as the time from Pod creation to readiness, encompassing scheduling, image pulling, container creation, and readiness probe success. Google Kubernetes Engine (GKE) exposes a “Startup Latency” dashboard for workloads that measures total startup latency from the Created status of the Pod until the Pod’s containers become ready, including image pulls.[16] GKE also provides a corresponding node startup latency dashboard focusing on the time from node creation to readiness for scheduling pods, which is relevant when evaluating cluster-autoscaler behavior.[16] These metrics are critical for benchmarking, because they reveal how quickly the orchestrator can respond to sudden load spikes by scaling out.

Startup latency decomposes into several phases: scheduling time, image download or retrieval time, container runtime startup time, and application initialization time. Image-pull behavior is particularly important and can be influenced by configuration. For example, in AWS ECS the ECS_IMAGE_PULL_BEHAVIOR parameter controls whether images are always pulled, pulled only when not present, or preferentially loaded from a cached copy on the instance.[19] Choosing “once” or “prefer-cached” can significantly speed up deployments by avoiding unnecessary image pulls for images already cached on the node.[19] While this parameter is specific to ECS, analogous caching behaviors exist for Docker and Kubernetes, where images present on a node are reused unless explicitly updated. Benchmarking must therefore carefully manage image caches: cold-start scenarios where nodes have no cached images will exhibit longer startup latencies due to network image pulls, whereas warm-start scenarios with pre-pulled images will show reduced startup times.

Kubernetes best-practices documentation for large clusters emphasizes avoiding overly dense nodes with many pods that can lead to resource contention and slow pod startups, and suggests limits such as no more than roughly 110 pods per node for stability.[20] It also recommends gating cluster scaling actions in batches to avoid cloud-provider rate limits for instance creation, which can affect node startup latency.[20] OpenShift performance guidance similarly notes that etcd performance, control-plane capacity, and storage latency can all influence how quickly the cluster processes new Pods and nodes.[29] In a fair comparison between Kubernetes and Swarm, one should define precise time-to-ready metrics, such as the time from issuing a scale-up request to the first new instance passing a health check, and measure these metrics under both cold (uncached images, new nodes) and warm (cached images, existing nodes) conditions.

For Docker Swarm, task startup latency depends on similar factors: decision latency for assigning tasks to nodes, Docker image availability and pull time, container runtime startup, and application initialization. Swarm’s image caching behavior is largely inherited from Docker Engine; images present on a node will be reused, while absent images will be pulled from registries. A fair benchmark should predefine the image state on each node for cold and warm runs, ensuring that Kubernetes and Swarm face equivalent image-cache conditions. It is also advisable to instrument application readiness using health checks and to define a time-to-ready metric aligned with the point at which the orchestrator’s service discovery system will start sending traffic to new instances.

Ultimately, pod or container time-to-ready metrics should be captured as distributions (with p50/p95/p99) across many scale-out events for each orchestrator and scenario. GKE’s startup latency dashboards illustrate how to operationalize these metrics and integrate them into observability stacks.[16] Benchmark harnesses should replicate similar metrics collection for both Kubernetes and Swarm via logs, metrics, and time-stamped events.

Orchestrator resource overhead: control-plane and per-node agents

Orchestrators consume CPU, memory, and I/O resources, both in the control plane and on worker nodes, thereby reducing the resources available for user workloads. In Kubernetes, control-plane components such as the API server, scheduler, controller manager, and etcd typically run on dedicated nodes, while per-node agents such as kubelet, kube-proxy, logging agents, and CNI plugins run on worker nodes.[20][29] OpenShift performance guidelines recommend keeping control-plane CPU and memory usage below around sixty percent of available capacity to maintain headroom for spikes and to avoid cascading failures.[29] They also advise against colocating other I/O-intensive workloads on control-plane nodes, and recommend fast, low-latency block storage for etcd to keep WAL fsync and backend commit latencies low.[29] These recommendations underscore the importance of measuring orchestrator overhead separately from application workloads.

On worker nodes, kubelet monitors Pod health, manages containers via the container runtime, and reports node status, while kube-proxy handles service traffic either via iptables or IPVS rules.[20][29] All of these components consume CPU cycles and memory, and in dense clusters their overhead can become significant. Kubernetes cluster-scale guidance suggests limiting pods per node and monitoring per-node resource usage to avoid overcommitment.[20] Monitoring orchestrator overhead per node thus involves tracking CPU and memory usage of Kubernetes system pods and processes, and potentially isolating them with cgroups or node labels so that their resource footprint is clearly separated from user workloads.

Docker Swarm has a lighter-weight control plane, with manager nodes running Raft and service scheduling logic, and worker nodes running the Docker daemon and overlay networking drivers.[21][39][45] Swarm’s per-node overhead tends to be lower than that of a full Kubernetes stack, simply because there are fewer components, but networking overhead from overlay networks and IPVS-based virtual IPs can be non-negligible.[39] Reports from practitioners indicate that Swarm overlay networks with virtual IP mode can impose a ten to thirty percent performance penalty compared to direct host networking, especially for latency-sensitive applications with high request rates, though this can be mitigated by using DNS round robin endpoint mode or alternative network drivers.[39] Measuring Swarm’s resource overhead therefore requires profiling Docker daemon CPU and memory usage, overlay network drivers, and any Swarm system containers, in addition to manager resource utilization.

In a fair comparison, control-plane nodes for Kubernetes and Swarm should be provisioned on similar hardware, and user workload capacity should be restricted to worker nodes. Resource overhead metrics to collect include average and peak CPU usage of control-plane components, memory footprint, and, where relevant, disk and network I/O utilization for etcd and Swarm managers. On worker nodes, resource usage of orchestrator agents should be measured as a fraction of total node capacity and as a function of pod/task density, to quantify how overhead scales with load. It is important to take measurements both at idle (no user workload, only orchestrator) and under workload stress, to distinguish fixed overhead from load-dependent overhead.

Horizontal scaling and scale-out behavior

Horizontal scaling refers to the ability of an orchestrator to adjust the number of application instances in response to load. Kubernetes provides a built-in Horizontal Pod Autoscaler (HPA) controller that periodically adjusts the desired scale of a target (Deployment, ReplicaSet, or certain custom resources) based on metrics such as CPU utilization, memory utilization, or custom metrics.[17] HPA reads resource requests defined in Pod specifications to compute utilization percentages, and can be configured to use multiple metrics, taking the maximum recommended scale among them as the desired replica count, bounded by configured minimum and maximum values.[17] This mechanism enables sophisticated autoscaling policies that react to resource usage and custom signals.

Kuberenetes cluster-wide scalability guidelines recommend upper bounds on nodes and pods per cluster and stress that cluster scaling actions should be gated to avoid cloud provider quota issues and rate limits on instance creation.[20] For example, scaling up by thousands of nodes at once may exceed provider limits, so operators are advised to scale in batches with pauses.[20] OpenShift similarly recommends enabling machine health checks and limiting control-plane resource utilization to avoid cascading failures when scaling to large node counts.[29] These constraints matter for benchmarking: scale-out behavior must respect realistic operational practices, and orchestrator comparisons should consider not only single deployment scaling but also cluster-wide scaling of nodes and workloads.

Docker Swarm does not have an HPA equivalent built into the core, but scaling services horizontally via replica counts is straightforward, and external autoscaling scripts or tools can adjust replica counts based on metrics. Swarm’s design emphasizes quick scaling and a straightforward mental model for developers, but lacks some of the advanced autoscaling features and extensible metrics pipeline native to Kubernetes.[1][3][21][15][1] In practice, this means that in a fair comparison, one might benchmark both orchestrators under manual scale-out conditions (changing replica counts without autoscaling) to measure raw scaling speed, and, optionally, evaluate Kubernetes’ HPA-driven scaling and Swarm with an external autoscaler to understand differences in policy responsiveness and stability.

Scale-out behavior can be evaluated through metrics such as time-to-first-ready instance after a scale-up request, time to reach target capacity (all new replicas ready), transient overload or underprovisioning during scaling, control-plane CPU and memory consumption during scaling, and any error rates or failures encountered due to rate limits or resource exhaustion. Kube-burner and ClusterLoader2 are Kubernetes-specific tools that can drive large numbers of resource creations and updates to stress-test scaling behavior and measure performance and scalability metrics across different cluster sizes and workloads.[6][7] These tools can be used to explore how Kubernetes handles scaling scenarios that might be challenging to reproduce manually.

In a comparative orchestrator benchmark, experiments should include moderate scale-out scenarios (for example, doubling replicas from a medium baseline) and larger stress tests (such as scaling from zero to high replica counts), under reproducible conditions and with consistent autoscaling policies where applicable. Time-series metrics should be recorded to capture dynamic behavior during scaling, and statistical summaries should be computed across repeated runs.

Node failure recovery and failover

Resiliency to node failures is a defining property of orchestrators. When a node fails or becomes unreachable, Kubernetes and Swarm must detect the failure, mark the node as not ready, reschedule workloads to healthy nodes (if sufficient capacity exists), and eventually evict or remove the failed node. In Kubernetes, node failure detection is managed by the node controller and kubelet via heartbeat mechanisms. Pods on unreachable nodes enter Terminating or Unknown states after a timeout, and are not deleted automatically until the node object is deleted, the kubelet resumes and completes deletion, or the Pod is force-deleted via the API.[18] By default, Kubernetes may wait up to several minutes before evicting Pods on an unreachable node, depending on configuration parameters such as pod-eviction-timeout.[18] These defaults aim to avoid premature eviction in transient network partitions but can increase perceived failover time.

IBM’s guidance for Db2 on Kubernetes explains that it can take up to five minutes by default before Kubernetes evicts Pods from an unreachable node; operators can expedite eviction via force deletion in some cases, but best practice is to either delete the node object if it is permanently failed or allow the kubelet to recover and gracefully kill the Pods.[18] This highlights a key benchmarking dimension: node-failure recovery time is not a fixed property of Kubernetes, but depends on configuration, such as the node-monitor grace period and eviction timeout. Fair comparison with Docker Swarm must therefore ensure that node failure detection and eviction timeouts are aligned as closely as possible between orchestrators, or, at a minimum, transparently reported and interpreted.

Docker Swarm relies on its Raft consensus group to detect manager failures and on heartbeats to detect worker node failures. When a worker fails, Swarm will reschedule tasks to other workers; when a manager fails, the remaining managers elect a new leader, provided quorum persists.[45] Swarm’s behavior under node failures is influenced by Raft’s election timeouts and heartbeat intervals, which affect how quickly failures are detected and how long leader elections take. While Swarm’s defaults are generally tuned for fast detection, benchmarking should document and, if necessary, adjust timers to align with Kubernetes settings for comparable experiments.

Chaos-engineering tools, including those integrated with Kubernetes, can be used to inject failures and measure recovery times. For example, Harness Chaos Engineering provides faults that introduce network latency to ECS container instances and can stop and start EC2 instances to simulate failures, while monitoring service availability and recovery.[28] While this specific fault targets ECS, similar techniques can be used with Kubernetes and Swarm: terminating nodes, introducing network partitions, or degrading network conditions, then measuring how quickly orchestrators restore service. Key metrics include time from failure injection to observability of service degradation, time to detect node failure, time to reschedule replicas, and time until service-level objectives (for example, p99 latency) return to steady-state levels.

In a benchmark harness, node-failure experiments should be run multiple times under identical conditions for both Kubernetes and Swarm, with controlled failure injection and precise time-stamping of events. The harness should record both control-plane events (node marked NotReady, Pod rescheduling, etc.) and data-plane metrics (error rates, throughput drops, latency spikes), enabling a comprehensive comparison of resiliency characteristics.

Networking and service-discovery overhead

Cluster networking and service discovery are major determinants of end-to-end performance in orchestrated environments. Kubernetes typically uses a Container Network Interface (CNI) plugin to provide pod-to-pod connectivity and has a built-in service abstraction implemented via kube-proxy, which configures iptables or IPVS rules to direct traffic to Pods.[20][29] Service discovery is commonly based on DNS, with CoreDNS acting as a cluster DNS server, resolving service names to virtual IPs or directly to Pod IPs.[44] Each layer can introduce latency and overhead: network overlays add encapsulation and routing overhead; kube-proxy rules can introduce extra hops or packet processing; and DNS resolution adds lookup latency, especially under high query loads.[4][44]

CoreDNS metrics provide insight into service discovery overhead by tracking queries per second, response latency, cache hit rates, and error responses.[44] Monitoring these metrics in Kubernetes clusters under benchmark workloads can reveal whether service discovery becomes a bottleneck, for example due to excessive DNS queries from application pods or misconfigurations. Kubernetes latency analysis further notes that network latency within and between nodes can contribute to request delays, and that network-intensive applications may benefit from specialized hardware or tuning.[4][30] Research on network acceleration in Kubernetes clusters with NVIDIA BlueField DPUs demonstrates that offloading network processing to specialized hardware can reduce latency by over twenty percent and significantly improve throughput and horizontal scalability, illustrating the sensitivity of Kubernetes networking performance to underlying infrastructure.[30]

Docker Swarm uses an overlay network for inter-container communication and supports two endpoint modes for services: virtual IP (VIP) and DNS round robin (DNSRR). In VIP mode, Swarm uses IPVS to load-balance traffic to service tasks, while in DNSRR mode, DNS queries resolve to multiple IPs and clients handle load balancing.[39] Practitioners report that using Swarm’s overlay network with VIPs can incur a ten to thirty percent performance penalty relative to direct host networking, particularly for latency-sensitive systems with high request rates.[39] They also note that for single-task-per-service scenarios such as databases, using DNSRR endpoint mode avoids the extra layer of VIP/IPVS and can reduce latency.[39] Additionally, encrypted overlays (for example, using IPsec) can add another several percent of overhead, and are discouraged for latency-critical systems.[39]

Fair benchmarking must therefore control networking configuration carefully. The choice of CNI plugin for Kubernetes, the use of IPVS or iptables in kube-proxy, the Swarm overlay configuration, and whether encrypted links are enabled should be explicitly documented and harmonized as much as possible. For example, one may configure Kubernetes to use a popular CNI and kube-proxy in IPVS mode, and Swarm to use overlay networks with VIPs, mirroring the most common production setups, and then perform follow-up experiments using more optimized configurations like host networking to understand best-case scenarios.[4][20][39] Networking metrics to collect include intra-cluster TCP/HTTP round-trip latency, packet loss, bandwidth, and service discovery lookup latency, as well as CPU overhead of network processing on nodes. Tools such as Fortio and custom ping or HTTP microbenchmarks can be used to measure network latency and throughput between pods or containers, with and without orchestration overhead.[10][30]

Pod density per node and cluster scale

Pod density per node and cluster size influence orchestrator performance and scalability. Kubernetes guidance for large clusters suggests limits of no more than on the order of 110 pods per node, 5,000 nodes per cluster, 150,000 total pods, and 300,000 total containers as ballpark upper bounds for stable operation under typical configurations.[20] These numbers are not hard limits but represent tested scales under specific assumptions. At high pod densities and large cluster sizes, kubelet, kube-proxy, and the control plane can experience increased load in the form of frequent state updates, health checks, and network rule management.[20][29] To mitigate issues, Kubernetes documentation recommends careful sizing of control-plane nodes, potentially dedicating separate etcd instances for event storage, and adjusting addon resource limits, among other practices.[20][29]

OpenShift’s performance guidelines reinforce the need to size etcd and control-plane components appropriately and to monitor key metrics such as p99 etcd disk WAL fsync duration and network peer latency, as well as the number of etcd leader changes, to detect performance degradation as clusters grow.[29] They also advise running core cluster components like CoreDNS and metrics-server with elevated priorities to ensure they are not preempted and can continue to serve essential functions under load.[29] These recommendations suggest that dense, large clusters may amplify differences between orchestrators and should be part of the benchmarking space if the target deployment environment involves high density or scale.

Docker Swarm does not publish comparable official scale guidelines, but practical experiences and limited studies suggest that Swarm can scale to hundreds or thousands of containers, with performance influenced by overlay network complexity, Raft manager load, and node resource capacity.[21][39][45] Swarm’s simpler architecture may provide an advantage at moderate scales, but it may also encounter scaling challenges in highly complex or multi-tenant environments. Comparative studies of container orchestration tools at the edge and in cloud contexts report that Kubernetes and its distributions can provide effective scheduling and management across heterogeneous resources, though they do not always include Swarm in large-scale scenarios.[37][37]

For a fair orchestrator comparison, one should evaluate performance across several pod- or task-density regimes: low density, where nodes run a small number of relatively large containers; moderate density; and high density approaching recommended limits for Kubernetes. Metrics should include application-level throughput and latency, orchestrator CPU and memory overhead per node, control-plane resource utilization, and any error or failure rates. This helps reveal whether either orchestrator degrades more sharply at high densities or scales more gracefully.

Image-pull behavior and caching

Container image pull behavior substantially affects startup latency and scaling responsiveness, especially in scenarios involving large images or frequent deployments. Systems like AWS ECS allow explicit configuration of image-pull behavior via parameters such as ECS_IMAGE_PULL_BEHAVIOR, which can be set to “default” (always attempt remote pull), “once” (pull only if not previously pulled or the cached image was cleaned), or “prefer-cached” (use cached image if present, without automated cleanup).[19] Setting this parameter to values that favor caching can reduce deployment times by avoiding redundant image downloads, at the cost of potentially using stale images if not managed carefully.[19]

Although Kubernetes and Docker Swarm do not expose exactly the same configuration knobs as ECS, their underlying Docker or container runtimes behave similarly: if an image with the specified tag already exists locally, it is reused, unless the pull policy or configuration forces a re-pull. To benchmark orchestrator behavior in a fair and controlled manner, one must therefore carefully define and manage image-cache state on each node. Cold-start experiments should ensure that images are not present on nodes before deployment, forcing full image pulls, while warm-start experiments should pre-pull images using identical mechanisms for both orchestrators. Image sizes and registry locations should be identical, ensuring that observed differences in pull times reflect orchestration and networking overhead rather than differences in content or registry performance.

USENIX work on cache warm-up demonstrates that cache warmup time depends on cache size and desired tolerance level, following a power-law relationship coupled with an exponential discount in tolerance, and stresses that warmup reflects workload characteristics such as access distributions.[25] Though this work targets LRU-style caches, its conceptual lesson applies broadly: benchmarking results are sensitive to whether systems are measured during warm-up or after caches have achieved steady-state hit rates. SpeedCurve’s guidance on performance testing in CI likewise highlights the need to warm caches before testing, noting that production environments typically enjoy high cache hit ratios and that tests should replicate this condition for realistic results.[25][34] For orchestrator benchmarks, this means that image caches on nodes, HTTP caches in front of services, and database caches should be warmed to representative levels before measuring steady-state performance, unless the goal is specifically to study cold-start behavior.

Incorporating image-pull behavior into the benchmarking methodology thus requires explicit test phases: cold-start tests that prefer fresh nodes and empty image caches to evaluate worst-case behavior, and warm-start tests with pre-populated caches to measure steady-state scaling responsiveness. Experiments must be conducted symmetrically for Kubernetes and Swarm, documenting cache state and registry conditions to preserve fairness.

Designing an Unbiased Experiment for Orchestrator Comparison

Clarifying goals and hypotheses

Designing a fair benchmarking experiment begins with clearly articulating the goals and hypotheses. In the context of comparing Kubernetes and Docker Swarm on identical hardware, typical goals might include quantifying differences in control-plane latency, pod/container startup time, steady-state application throughput and p99 latency, orchestrator resource overhead at various scales, horizontal scaling behavior, and failover times. Hypotheses should be concrete and falsifiable, such as: “On identical hardware and with identical workloads, Kubernetes exhibits lower p99 latency at high throughput for a microservices workload than Docker Swarm,” or “Docker Swarm achieves faster time-to-first-ready instance during small scale-out events due to a lighter-weight control plane.”

Articulating hypotheses helps structure the experiment design and ensures that collected metrics are aligned with what is being tested. It also encourages thinking explicitly about potential confounding variables and error sources. For example, if one hypothesizes that Kubernetes has superior scheduling performance at large scale due to a more sophisticated scheduler, one must ensure that Swarm is not handicapped by a misconfigured network or lower resource allocations on manager nodes. Literature on benchmarking container orchestration systems emphasizes that testing objectives span functionality, resiliency, performance, security, and observability, and that performance benchmarks should be tailored to the specific aspects being evaluated.[5][5] By mapping hypotheses to these categories, one can systematically design experiments that are neither too narrow nor too diffuse.

Selecting a neutral hardware and software substrate

To attribute performance differences to the orchestrator rather than the underlying infrastructure, Kubernetes and Docker Swarm must be deployed on identical hardware and as similar a software substrate as possible. This typically involves using the same physical machines or homogeneous virtual machines with identical CPU, memory, storage, and network characteristics, the same operating system distribution and version, and the same kernel and container runtime versions. Cloud environments introduce variability and quotas, so if benchmarking in the cloud, one should choose instance types with deterministic performance characteristics and ensure that clusters for both orchestrators are provisioned in the same availability zones, with equivalent storage and networking configurations.[20][29]

OpenShift performance guidelines stress that etcd should run on fast, low-latency block devices with specific IOPS characteristics, and that network-attached storage can introduce unpredictable latency.[29] They also recommend avoiding NFS or other network file systems for etcd, and instead using local SSDs or NVMe devices, potentially via PCI passthrough, to ensure predictable performance.[29] These recommendations apply equally to any Kubernetes distribution being benchmarked and, by extension, to comparable storage configurations in Swarm-based clusters. If Kubernetes uses local SSDs for etcd and Swarm managers use slower network-attached disks for Docker metadata, differences in control-plane responsiveness may reflect storage rather than orchestration differences.

Benchmarking best practice is increasingly to encapsulate benchmark software in “hermetically sealed containers” that include all dependencies and configuration, ensuring consistent runtime environments across systems.[33] This approach minimizes variability stemming from different libraries, compilers, or runtime versions. Containerized benchmark suites for data-center hardware, such as those showcased in open computing projects, emphasize that containers should package as much of the software stack as possible, leaving only a minimal substrate difference such as Linux flavor or kernel version.[33] The same principle applies here: application workloads and load generators should be built as identical container images used on both Kubernetes and Swarm, thereby eliminating differences in application-level stacks.

Baseline benchmarking of the hardware using tools like sysbench for CPU, memory, and file I/O can help detect anomalies and verify that machines are indeed comparable.[12][29] Sysbench can test CPU performance through prime number calculations, memory performance through sequential write and read tests, and file I/O performance through random and sequential workloads.[12] OpenShift documentation additionally recommends tools like fio for measuring storage performance relevant to etcd.[29] Running such microbenchmarks before orchestrator deployment allows one to document hardware performance and exclude nodes that exhibit outlier behavior due to underlying issues.

Workload design: identical applications, images, and resource limits

A cornerstone of unbiased orchestrator comparison is the use of identical workloads, implemented via identical container images and resource configurations. Studies comparing container orchestration tools often employ a mix of synthetic benchmarks and real-world applications, such as microservices-based web services, databases, or streaming systems.[2][23][24][37][37] A comparative study of container orchestration versus serverless platforms, for instance, evaluated performance under various workloads to understand differences in response time and scaling behavior.[23] Similarly, frameworks like COFFEE define benchmark workloads for orchestrators and automate deployment to ensure consistency.[24] Drawing from these efforts, one should define a suite of workloads that cover relevant scenarios, such as simple stateless HTTP services, stateful services with databases, and CPU- or memory-intensive batch jobs.

For each workload, a single container image should be built and versioned, then deployed unmodified on both Kubernetes and Swarm. Resource requests and limits (CPU and memory) must be consistently configured. In Kubernetes, resource requests inform HPA and scheduling decisions, and misalignment can cause under- or over-provisioning.[17][20] Therefore, the same resource reservations should be applied conceptually in Swarm, using reservation configurations in compose files or equivalent. Replica counts should be identical in baseline runs, and any autoscaling policies must be configured to be as equivalent as possible; for example, both orchestrators might scale based on CPU utilization thresholds.

Care must also be taken to ensure that application-level configurations, such as thread pools, connection limits, and cache sizes, are identical across orchestrators. If Kubernetes deployments specify environment variables or config maps that tune application behavior, Swarm stacks must replicate the same configuration. Failure to do so can result in behavior differences that are incorrectly attributed to orchestration. Academic comparisons of container orchestration engines emphasize that functional capabilities and scheduling policies differ among systems, but for performance assessment the workload semantics must be held constant.[2][31]

Finally, workload arrival patterns and traffic profiles must be carefully designed, ideally based on representative production traces or standard benchmark suites. Simple constant-rate or step-load patterns can be used for controlled experiments, but real-world workloads often exhibit diurnal or bursty traffic. The benchmark harness should provide configurable patterns to explore orchestrator behavior under different load dynamics.

Managing caches, warm-up, and steady-state measurement

Cache effects, both at the infrastructure and application layers, are a pervasive source of bias in performance measurements. The USENIX study on cache warm-up demonstrates that the time required for a cache to reach performance close to a continuously running cache depends on cache size and the tolerated difference in hit rate, with warmup time scaling according to a power-law relationship modulated by an exponential term in the tolerance parameter.[25] This implies that warmup behavior is workload-dependent and not easily captured by a single fixed duration; instead, one must monitor performance until metrics stabilize within acceptable bounds. SpeedCurve’s guidance on performance testing in CI further underscores that caches should be warmed prior to testing because production systems typically operate with warm caches and high cache hit ratios.[34]

In orchestrator benchmarking, there are several caches to consider. Container image caches on nodes determine whether images need to be pulled over the network. Application-level caches, such as HTTP response caches, database caches, or in-memory key–value stores, influence response times and throughput. DNS caches at clients and CoreDNS can affect service discovery latencies. Benchmark harnesses must define clear cache conditions for each experiment. For cold-start tests, images should be removed from nodes, and application caches flushed or restarted. For warm-state tests, an explicit warm-up phase should run for sufficient time and volume to achieve stable hit ratios and throughput, as evidenced by monitoring metrics.

Warm-up phases should be clearly separated from measurement phases. During warm-up, load is applied to the system to fill caches and allow just-in-time compilation or adaptive optimizers within runtimes to stabilize. Metrics during this phase are useful for understanding cold-start behavior but should not be conflated with steady-state performance. Once metrics such as throughput and p99 latency stabilize within a narrow band over time, the benchmark harness can begin a measurement phase of fixed duration or number of requests, during which results are recorded for analysis. This two-phase approach aligns with best practices in performance testing and helps avoid underestimating steady-state performance due to transient warmup effects.[25][34][42]

Image-pull caching deserves particular attention. Nodes should be prepared according to the desired scenario: for cold-start, ensure no relevant images are present; for warm-start with cached images, pre-pull images on all nodes and avoid concurrent image pulls during the measurement phase.[19][25] Logs and metrics from container runtimes can be used to verify whether images were pulled or reused. Similar care should be applied to database caches, ensuring either that caches are cold (for worst-case analyses) or pre-filled with representative working sets before measurement.

Load profiles, duration, and avoiding load-generator bottlenecks

Load profiles define how traffic arrives at the system under test. Benchmark harnesses must support configurable load patterns, such as constant-rate traffic, ramp-up and ramp-down curves, or bursty patterns, and must ensure that the load generator itself does not become the limiting factor. Tools like wrk, Fortio, Vegeta, and k6 are widely used for HTTP load generation and can be configured for various loads and durations.[8][9][10][11] For example, wrk allows specifying the number of threads, connections, and test duration, and generates significant load on a single multi-core CPU.[8] Vegeta and Fortio focus on generating constant QPS with a configurable rate, while Fortio additionally provides a web UI and REST API for controlling tests and viewing results, including p99 latency.[10][11]

The k6 documentation on running large tests emphasizes that maximizing generated load is a multi-faceted process, involving OS tuning to increase network and user limits, monitoring CPU and memory usage on the load generator, and designing efficient test scripts.[9] It recommends ensuring that CPU utilization for the load generator remains below about eighty percent to avoid throttling and that memory utilization stays below ninety percent to avoid swapping.[9] It also suggests monitoring network bandwidth to detect NIC saturation; if traffic is capped at, say, one gigabit per second, upgrading to larger instances may be necessary.[9] These recommendations apply equally to wrk, Fortio, and Vegeta: the machine(s) generating load must be provisioned with sufficient CPU, memory, and network capacity, and their resource utilization must be monitored during tests.

Test duration must be long enough to accumulate a statistically meaningful number of samples, especially for tail percentiles like p99. Short tests with few requests produce unstable estimates of high percentiles, effectively equating p99 with maximum latency in small samples.[26] High-volume systems and industry practices often rely on thousands or tens of thousands of requests to compute stable percentile estimates. Depending on workload intensity, this may require multi-minute tests for each configuration. The benchmark harness should parameterize test duration and request volume, and include checks that the number of samples is adequate for robust percentile estimation.

In orchestrator comparisons, it is also useful to vary load intensity, measuring performance under low, moderate, and high load, and identifying saturation points. As load increases, the system under test will eventually reach maximum throughput, beyond which latency rises sharply and error rates increase. Comparing where this knee occurs for Kubernetes and Swarm can reveal differences in scalability and resilience under pressure. Load profiles should be identical between orchestrators for each experiment, and the harness should verify that load generators are achieving the intended rates.

Replications, randomness, and statistical treatment

Performance measurements are inherently noisy due to factors such as scheduling variability, background processes, network jitter, and non-deterministic timing behaviors. To draw reliable conclusions about differences between Kubernetes and Swarm, experiments must be repeated multiple times under identical conditions, and results must be analyzed using appropriate statistical methods. Academic studies on container orchestration and network acceleration in Kubernetes clusters illustrate this rigor. For example, a thesis benchmarking network acceleration in Kubernetes clusters with NVIDIA BlueField DPUs used independent-samples t-tests to assess statistical significance and reported p-values less than (10^{-6}) when observing a twenty-three percent reduction in latency compared to baseline processing.[30] They also reported horizontal scaling efficiency and throughput improvements with confidence, reflecting careful statistical analysis.[30]

Similarly, comparative analyses of scheduling algorithms and other systems components use statistical tests and confidence intervals to determine whether observed differences are meaningful.[38] In orchestrator benchmarking, one may collect metrics such as mean p99 latency or mean throughput over multiple runs for each configuration. For each metric, the harness can compute sample means, variances, and confidence intervals, and apply tests such as t-tests or non-parametric equivalents where appropriate to assess whether differences between Kubernetes and Swarm are statistically significant. Visualizing distributions via boxplots or histograms can provide further insight into variability and outliers.

Percentiles themselves are more complex to handle statistically than means, but approximate methods exist for confidence intervals on quantiles, and bootstrapping can be applied if necessary. At minimum, reporting both median and p95 or p99 latencies across runs, along with measures of variation, provides a richer picture than single-run point estimates. The p99 latency article emphasizes that high percentiles must be interpreted alongside lower percentiles and awareness of time windows, since p99 over a one-minute window may differ significantly from p99 over an hour.[26] Benchmark harnesses should thus document time windows for metrics aggregation and ensure consistency across runs.

Randomness is present in many aspects of distributed systems, such as connection scheduling, request routing, and background workload. To reduce sensitivity to transient fluctuations, one should randomize the order of experiment conditions when possible, avoid always running Kubernetes before Swarm or vice versa, and ensure that intervening time is sufficient for the system to return to baseline conditions. When experiments involve randomization within workloads (for example, random selection of request payloads), seeds should be logged to enable reproducibility.

Isolation and interference control

One of the most pervasive threats to fair benchmarking is interference from other workloads or system activities unrelated to the experiment. For orchestrator comparisons, this means ensuring that no extraneous workloads run on cluster nodes beyond the orchestrator’s own system services and the workloads being benchmarked. Control-plane nodes should not run user workloads, especially I/O-intensive ones, as emphasized by OpenShift best practices.[29] Worker nodes should be dedicated to the benchmark workloads during experiments, and background processes such as package updates or backup tasks should be disabled or scheduled outside the benchmark window.

The environment in which the load generator runs also requires isolation and monitoring. K6 documentation recommends monitoring CPU, memory, and network usage on the load-generator machine using tools like htop, nmon, or iftop, and suggests keeping load-generator CPU utilization below eighty percent to avoid self-throttling.[9] Similar guidelines apply to wrk, Vegeta, and Fortio: load generators should not contend with other applications for resources, and their own resource limits should be set generously enough that they can saturate the system under test without saturating themselves.[8][10][11]

System-level performance baselines can help detect interference. Sysbench tests for CPU, memory, and file I/O can be run periodically to ensure that nodes exhibit consistent performance, and deviations may indicate contention or hardware issues.[12] For storage, OpenShift recommends using fio to benchmark block device performance for etcd and to continually monitor cluster performance as scale increases.[29] If baseline tests reveal anomalies, nodes should be investigated or excluded from experiments.

Network interference can arise from other traffic within the same data center or cloud region. While this cannot always be eliminated, one can mitigate its impact by running experiments at times of relatively low contention, using dedicated VLANs or subnets where possible, and monitoring network latency and bandwidth during tests. Chaos or fault-injection tools should be disabled unless they are explicitly part of the experiment, as they can introduce confounding failures.

Automation and reproducibility practices

Given the complexity of orchestrator benchmarking, automation is essential for reproducibility and fairness. An automated harness should orchestrate the entire lifecycle of experiments, including cluster provisioning, workload deployment, cache preparation, warm-up, load generation, metrics collection, teardown, and result analysis. Infrastructure-as-code tools and configuration management frameworks can encode cluster configurations for Kubernetes and Swarm, ensuring that deployments are repeatable and version-controlled.

The concept of hermetically sealed benchmark containers provides a model for encapsulating benchmark logic and dependencies in standardized images.[33] For example, each workload could be packaged in a container image that includes the application, dependencies, and instrumented logging, and each load generator could be packaged similarly with its scripts and configuration. These images would be used unchanged across orchestrators, ensuring that only the orchestration layer varies. Versioning of images and configurations should be managed via a source-control system, and the harness should record image digests and configuration hashes for each experiment run.

Frameworks like COFFEE demonstrate how automated benchmarking infrastructures for container orchestrators can be designed to run experiments across self-hosted and cloud environments, manage multiple orchestrators, and produce comparable metrics.[24] Similarly, the “awesome harness engineering” resources collected for AI agents illustrate patterns and tools for evaluation harnesses, including observability, orchestration, and metric analysis.[40] While these are not orchestrator-specific, they underline the importance of designing harnesses as first-class systems with modular architectures.

Automated integration with CI/CD pipelines can enable continuous performance regression testing. As SpeedCurve notes, performance testing can be integrated into CI such that the build is broken if performance budgets are violated, forcing teams to treat performance as a non-functional requirement in acceptance criteria.[34] For orchestrator configurations, one could imagine a pipeline that, upon configuration changes, runs a subset of benchmark workloads on a small test cluster and fails the build if p99 latency or throughput degrade beyond thresholds compared to baselines. Such integration requires the benchmark harness to be fast enough and scalable enough to run regularly, but even periodic use can help catch performance regressions.

Tools for Workload Generation and Orchestrator Benchmarking

wrk: high-throughput HTTP benchmarking

Wrk is a modern HTTP benchmarking tool capable of generating significant load from a single multi-core CPU by combining a multithreaded design with scalable event notification mechanisms such as epoll and kqueue.[8] It supports configurable numbers of threads and connections, test durations, and optional LuaJIT scripting for customizing HTTP request generation, response processing, and reporting.[8] For example, a typical command might run a thirty-second benchmark with twelve threads and four hundred concurrent connections against a specified URL.[8] Wrk reports metrics including requests per second, latency distribution, and error counts, making it well-suited for stress-testing HTTP endpoints.

In the context of orchestrator benchmarking, wrk can be used to generate load against services deployed on Kubernetes and Swarm, measuring throughput and latency distributions. Because wrk runs as a single process, it is particularly appropriate when the system under test can be saturated by a single load-generator host. Attention must be paid to the load-generator machine’s network and CPU limits: wrk’s documentation recommends ensuring sufficient ephemeral ports and a large listen backlog on the server to handle connection bursts, and warns that per-request Lua scripting can reduce achievable load.[8] These considerations align with general advice on avoiding load-generator bottlenecks.

Wrk’s strengths include its simplicity, high performance, and low overhead. Its limitations include a relatively simple statistical output compared to tools like Fortio or k6, and limited built-in support for complex test scenarios. However, its LuaJIT scripting capability allows flexible request patterns if needed. In an automated harness, wrk can be wrapped in scripts that set test parameters, run benchmarks, and parse results, storing throughput and latency metrics for analysis.

k6: scriptable, CI-friendly load testing

K6 is a modern, scriptable load testing tool designed for integration with CI/CD pipelines and cloud-based load-generation services.[9] It uses JavaScript test scripts to define scenarios, checks, thresholds, and custom metrics, and can run locally or in distributed fashion. K6’s documentation on running large tests provides detailed guidance on maximizing load generation, emphasizing OS tuning, monitoring of CPU, memory, and network usage, and efficient scripting.[9] It advises keeping CPU utilization for k6 below about eighty percent and memory utilization below ninety percent, and recommends disabling swap to avoid performance degradation.[9]

K6 supports features such as streaming results to the cloud, discardResponseBodies to save memory, and optional suppression of local thresholds and summaries when streaming, as these operations would otherwise be duplicated.[9] It notes that features like checks, groups, custom metrics, and thresholds with abort-on-fail entail additional computation and can reduce achieved load; for maximal load generation, these should be used sparingly.[9] K6 also supports execution segmentation, enabling load to be split across multiple k6 instances, which is useful for very large-scale tests.[9]

For orchestrator benchmarking, k6 is particularly valuable when test scenarios need to model realistic user flows involving multiple requests, authentication, or stateful interactions. Its integration with CI pipelines makes it a natural choice for automated regression testing of orchestrator configurations. The harness can maintain separate k6 scripts for each workload, and k6 metrics such as response-time percentiles and request rates can be exported to Prometheus or other monitoring systems for unified analysis. When using k6, one must monitor the load-generator’s resources, as recommended, to ensure the orchestrator is the bottleneck, not k6 itself.[9]

Fortio: QPS-targeted testing with rich histograms

Fortio is a load testing library, command-line tool, and web UI focused on microservices and HTTP/GRPC workloads, originally developed as Istio’s load testing tool and later generalized.[10] It runs at a specified target queries per second and records a histogram of execution times, from which it computes percentiles such as p99.[10] Fortio can run for a fixed duration, fixed number of calls, or until interrupted, at either constant target QPS or maximum speed per connection and thread.[10] It is implemented in Go and packaged as a small Docker image with minimal dependencies, and its server component provides a simple web UI and REST API to trigger runs and visualize results, including comparative graphs of min, max, average, QPS, and percentile latencies across multiple runs.[10]

Fortio’s emphasis on controlling QPS directly makes it suitable for experiments where one wishes to compare orchestrator behavior under specific load levels. By fixing target QPS and observing latency and error rates, one can identify the load level at which performance degrades for each orchestrator. Its histogram-based reporting is well-aligned with the focus on tail latency metrics such as p99. The ability to run Fortio as a Docker container facilitates its integration into a benchmark harness: the same Fortio image can be deployed against services running on Kubernetes and Swarm, ensuring identical load-generation software.

Vegeta: constant-rate load testing as a library and CLI

Vegeta is a versatile HTTP load testing tool and library designed to “drill HTTP services with a constant request rate.”[11] It reads a list of targets (URLs and methods) and generates requests at a specified rate, reporting metrics such as latency distributions, success rates, and throughput.[11] Vegeta’s constant-rate focus makes it similar to Fortio in that it controls request rate explicitly, enabling precise evaluation of system behavior at defined loads. Vegeta can be integrated programmatically as a Go library or used via its command-line interface, making it suitable both for standalone tests and for embedding within custom harnesses.

In orchestrator benchmarking, Vegeta can be deployed as a load generator container, executing predefined target lists against services in Kubernetes and Swarm clusters. Its simplicity and library nature allow embedding in custom controllers that might orchestrate multiple runs and integrate results with other metrics. As with other tools, it is vital to ensure that Vegeta’s host has adequate resources so that the orchestrator and application are the limiting factors.

Kubernetes-specific scalability tools: kube-burner and ClusterLoader2

Kubernetes has specialized tools for performance and scalability testing that operate at the cluster level, stressing the API server, scheduler, and controllers. Kube-burner is a Kubernetes performance and scale test orchestration toolset written in Go that uses the official Kubernetes client-go library.[6] It can create, delete, and patch Kubernetes resources at scale; collect and index Prometheus metrics; define measurements; and set up alerting.[6] As a binary application, it orchestrates large-scale experiments, for example creating thousands of pods or deployments, and then measures cluster behavior via Prometheus metrics.[6] This makes kube-burner ideal for exercising Kubernetes control-plane scalability and for validating recommendations such as those in the Kubernetes large cluster guidelines.[20]

ClusterLoader2 is another Kubernetes load testing tool and is the official scalability and performance testing framework for Kubernetes.[7] It follows a “bring your own YAML” model: users define workloads in YAML manifests, and ClusterLoader2 deploys them at scale, monitoring cluster behavior.[7] It is used by the Kubernetes community to test scaling behavior across versions and configurations. Its strengths include tight integration with Kubernetes, strong community support, and alignment with upstream testing methodologies.

While kube-burner and ClusterLoader2 do not directly apply to Docker Swarm, they provide a rigorous way to explore Kubernetes’ scalability limits and behavior under large-scale resource churn. In an orchestrator comparison, these tools could be used to characterize Kubernetes in depth, while analogous, custom scripts would be needed to generate similar stress on Swarm’s API and control plane. The insights gained from these tools can inform the design of more comparable experiments, such as matching the rate of service and task creations in Swarm to the rate of deployments and pods created in Kubernetes.

System-level microbenchmarks: sysbench and storage benchmarking

Sysbench is a portable, multi-threaded benchmark tool capable of testing CPU, memory, and file I/O performance.[12] It can measure CPU performance by calculating prime numbers, memory performance by writing data in blocks of configurable size, and file I/O performance by reading and writing test files in various patterns.[12] Although Sysbench is typically used to benchmark database systems, its CPU and memory tests provide useful insight into raw computation and memory bandwidth on cluster nodes.[12] Running Sysbench on nodes before orchestrator deployment ensures that hardware performance is as expected and helps identify outlier nodes or transient issues.

For storage-specific benchmarking, tools like fio are recommended by platforms such as OpenShift for sizing and monitoring etcd storage.[29] Fio can generate various I/O workloads and measure latency, throughput, and IOPS, helping ensure that etcd’s storage backend meets recommended performance guidelines, such as achieving at least 50 IOPS of 8 KB sequential writes with about 10 ms latency, and for heavy clusters, 500 IOPS with about 2 ms latency.[29] Because etcd is central to Kubernetes control-plane performance, storage benchmarking is a critical prerequisite for fair evaluation.

In orchestrator benchmarking, these microbenchmarks are not directly used to compare Kubernetes and Swarm, but they provide baseline measurements that contextualize observed orchestration performance. If Kubernetes appears slower in scheduling than Swarm on one set of nodes but storage benchmarks reveal slower disks for etcd, the apparent performance difference may not be intrinsic to the orchestrator. Incorporating microbenchmark results into the analysis helps mitigate such misinterpretations.

Observability stack: etcd, CoreDNS, and cluster metrics

A robust observability stack is essential for orchestrator benchmarking. For Kubernetes, monitoring etcd metrics is particularly important, as etcd is the system of record for cluster state.[43] Etcd exposes metrics such as disk WAL fsync duration histograms, backend commit latencies, network peer round-trip time histograms, and proposal failure counts, which can be scraped by Prometheus.[29][43] Analyzing p99 latencies for WAL fsync and peer RTT, and monitoring the rate of proposal failures, can reveal whether etcd is under stress, suffers from storage or network issues, or experiences cluster leader instability.[29][43] Sysdig’s guide to monitoring etcd describes how to configure Prometheus to scrape etcd metrics and highlights the importance of metrics like etcd_network_peer_round_trip_time_seconds and etcd_disk_wal_fsync_duration_seconds in assessing cluster health.[43]

CoreDNS metrics, including query counts, error rates, and response latencies, provide insight into service discovery performance and can highlight misconfigurations such as infinite loops or excessive query loads.[44] Monitoring these metrics alongside application-level metrics helps determine whether DNS resolution contributes to high latencies or error rates in benchmark runs.

Kubernetes also provides a variety of control-plane and scheduler metrics, such as PodSchedulingDuration and API server request latency, which can be scraped via Prometheus and analyzed.[13][13] GKE’s startup latency metrics for pods and nodes illustrate how observability systems can be configured to track pod and node startup times, which can be replicated in self-managed clusters via custom metrics.[16] For Docker Swarm, observability is more limited by default, but one can instrument Docker APIs, monitor manager node resource usage, and use tools like cAdvisor and Prometheus exporters to gather container and node metrics.

An orchestrator benchmark harness should integrate these observability components, scraping metrics during experiments and storing them alongside application-level metrics and load-generator outputs. This unified dataset enables comprehensive analysis and debugging and supports the identification of bottlenecks and unexpected behaviors.

Common Pitfalls and Sources of Bias in Orchestrator Comparisons

Benchmarking research emphasizes that benchmarks are often incomplete and biased, especially when they focus solely on “data-path” performance while ignoring control operations and management overhead.[42] Container orchestrator comparisons are susceptible to a variety of pitfalls that can invalidate results if not carefully controlled. One common pitfall is configuration asymmetry: deploying Kubernetes and Docker Swarm with different resource allocations, network settings, or security configurations, then attributing performance differences to the orchestrator rather than to configuration choices. For example, running Kubernetes control-plane components on modest VMs with network-attached storage for etcd while running Swarm managers on larger VMs with local SSDs biases the comparison.[29] Similarly, enabling network encryption or more complex CNI plugins in Kubernetes while using simple overlays in Swarm introduces overhead differences that must be acknowledged.[39]

Networking configuration is a particular source of bias. In Docker Swarm, using overlay networks with VIP-based service endpoints imposes a known ten to thirty percent performance penalty compared to host networking, especially for latency-sensitive workloads.[39] Choosing VIP mode for Swarm while using host networking or optimized CNI configurations in Kubernetes biases results against Swarm. Conversely, configuring Kubernetes with heavyweight CNI plugins or IPVS modes that introduce additional overhead while Swarm uses minimal overlays would bias results in the opposite direction. Fair comparisons must therefore either harmonize networking modes or, at a minimum, test equivalent configurations for both orchestrators, such as using overlay networking with load-balancing in both.

Another pitfall is ignoring load-generator limits. K6 documentation warns that if k6 uses one hundred percent CPU to generate load, tests experience throttling and reported performance may mistakenly reflect load-generator limitations rather than system-under-test behavior.[9] The same applies to wrk, Fortio, and Vegeta: if the load generator saturates CPU, memory, or NIC, it cannot increase load further, and any observed plateau in throughput may simply be the generator’s ceiling. This is especially insidious when comparing orchestrators, since one might mistakenly conclude that both orchestrators saturate at the same throughput, while in fact the load generator is the limiting factor in both cases. Monitoring the load-generator’s resource utilization is therefore essential.

Cache effects and warm-up also introduce bias if not handled consistently. Measuring Kubernetes after a warm-up period with pre-populated caches, while measuring Swarm immediately after startup with cold caches, will overstate Kubernetes performance. The USENIX cache warm-up study demonstrates that warm-up behavior is highly workload-specific and must be measured explicitly.[25] Benchmarks must either compare cold-start behaviors for both orchestrators or compare warm-state behaviors for both, with explicit warm-up phases and steady-state measurements in each case.[25][34]

A subtler source of bias arises from differences in feature sets. Kubernetes offers more advanced scheduling and autoscaling features, which can be configured to optimize performance at the cost of complexity, while Swarm emphasizes simplicity.[1][3][21][15][1] If one configures Kubernetes with HPA and custom metrics optimized for the workload, but operates Swarm with a naïve static replica count, the comparison is no longer apples-to-apples. Conversely, if one uses only the simplest Kubernetes features to match Swarm’s configuration, the comparison may underrepresent Kubernetes’ capabilities in practice. To mitigate this, one should define comparison scenarios transparently. In “equal-footing” scenarios, both orchestrators should be configured with comparable simplicity (for example, static replica counts) to isolate baseline performance. In “feature-enabled” scenarios, each orchestrator can be tuned using its native capabilities to optimize performance, with the goal of comparing “maximally tuned” behavior.

Benchmarking in shared environments also risks interference from other workloads, as discussed previously. If Kubernetes and Swarm clusters share the same physical infrastructure and experiments are run sequentially without cleaning up or rebalancing, lingering effects such as cached data, warmed CPU caches, or network congestion may affect later runs. Proper experimental design randomizes run order, allows systems to return to equilibrium between runs, and uses isolation where possible.

Finally, measurement and analysis errors can bias conclusions. Reporting only average latency hides tail behaviors; using insufficient sample sizes destabilizes p99 estimates; aggregating metrics over excessively long windows obscures temporal patterns; and failing to account for variance and statistical significance can lead to overinterpretation of small differences. The p99 latency article underscores the importance of interpreting p99 alongside p50 and considering the time window of measurement.[26] Studies such as the BlueField DPU benchmarking thesis demonstrate the value of reporting confidence intervals and p-values to support claims of performance improvements.[30] Orchestrator comparisons should adopt similar statistical rigor, resisting the temptation to cherry-pick favorable runs.

Comparative Studies of Kubernetes, Docker Swarm, and Other Orchestrators

Academic comparisons of container orchestration engines

Several academic works have conducted systematic comparisons of container orchestration engines, focusing on functional capabilities, performance, or both. A thorough functional and performance comparison of container orchestration engines examines their scheduling policies, scalability, resiliency, and overheads across various workloads and environments, often including Kubernetes, Docker Swarm, and others.[2] Such studies typically evaluate metrics like deployment times, response latency, resource utilization, and scaling behavior, and highlight trade-offs such as Kubernetes’ rich scheduling semantics and network policies versus Swarm’s simplicity and lighter control-plane footprint.[2][31]

A comparative study published in IJRAH provides an unbiased comparison of container orchestration engines, focusing on technical capabilities while controlling for non-technical factors.[31] This study compares Kubernetes, Docker Swarm, and other engines across dimensions such as cluster management, service discovery, scaling, load balancing, and fault tolerance, and may include performance measurements for specific workloads.[31] Its methodology emphasizes holding constant the workload and environment while varying orchestrators, aligning with the principles discussed earlier.

Literature reviews specifically on testing container orchestration systems analyze published work and identify key testing objectives, including functionality, resiliency, performance, security, and observability, and note a lack of standardized benchmarks and tools for comprehensive evaluation.[5][5] They also mention performance comparisons of cloud-based container orchestration tools, highlighting that many existing studies focus on specific use cases or environments and that broader, more systematic benchmarking frameworks are needed.[5][24]

Work on performance evaluation of container orchestration tools in edge computing environments examines how orchestrators perform under constrained resources, heterogeneous nodes, and dynamic network conditions.[37][37] These studies often compare Kubernetes-based solutions with other orchestration frameworks, assessing scheduling delays, response latency, and resource efficiency in edge scenarios. They generally find that Kubernetes and its distributions can provide effective scheduling across resources at the edge but also highlight challenges in adapting Kubernetes’ relatively heavyweight control plane to highly constrained environments.[37][37]

COFFEE, a systematic benchmarking framework for container orchestration frameworks, provides a structured approach for analyzing and comparing orchestrators such as Kubernetes and Nomad in both self-hosted and cloud environments.[24] It automates experiment execution, metric collection, and analysis, and has been used to derive insights into how different orchestrators handle scaling and fault scenarios.[24] COFFEE’s design principles—automation, reproducibility, workload configurability, and comprehensive metric collection—align closely with the requirements for a benchmark harness comparing Kubernetes and Docker Swarm.

Industry analyses and practitioner reports on Kubernetes vs Docker Swarm

Industry blogs and whitepapers often compare Kubernetes and Docker Swarm from an operational and feature perspective, with occasional performance claims. IBM’s comparison describes Kubernetes as a portable, open-source platform for managing containers and their complex production workloads and scalability, with built-in horizontal autoscaling, while Swarm is characterized as Docker’s native support for orchestrating clusters of Docker engines, emphasizing quick scaling and simplicity.[3] IBM notes that both platforms allow managing containers and scaling deployments, and that their differences center around complexity and ecosystem maturity.[3]

Wallarm’s comparison emphasizes that Kubernetes, with its complex structure, excels in scalability and is more suitable for larger projects, whereas Docker Swarm has schematics that comply with an adaptable model that fuels scalability while reducing the risk of single points of failure.[1][1] These characterizations highlight Kubernetes’ strength in large-scale, multi-tenant environments and Swarm’s appeal for simpler, smaller deployments.

Platform9 and SUSE’s Rancher blog posts compare Kubernetes with Docker Swarm in terms of cluster architecture, service discovery, scaling, and load balancing, generally concluding that Kubernetes offers more advanced features and flexibility, while Swarm provides a gentler learning curve and tight integration with Docker.[21][15] They mention auto-scaling capabilities, robustness of networking and storage integrations, and community support as differentiators.[21][15] These analyses, while not deeply quantitative in performance terms, set expectations: Kubernetes is expected to be heavier but more powerful; Swarm is expected to be lighter and easier but potentially less scalable in complex scenarios.

Practitioner reports on Docker Swarm networking, such as GitHub discussions about overlay network latency, provide valuable empirical insights. One such discussion notes increased latency in latency-sensitive applications with high request rates when using overlay networks and VIP-based service endpoints in Swarm, estimating a ten to thirty percent performance hit due to overlay and VIP overhead.[39] It suggests using DNS round robin endpoint mode for single-backend tasks to avoid extra hops through VIP/IPVS, upgrading to instances with faster networking, avoiding encrypted overlays, and considering host or bridge networking when overlay networks cannot meet latency requirements.[39] These recommendations illustrate both the performance implications of network configuration and the tuning required for Swarm in high-performance scenarios.

Guides on Kubernetes mistakes and best practices often mention performance pitfalls, such as overloading clusters, duplicating deployment strategies, misconfiguring resource requests and limits, and neglecting observability.[32] These are indirectly relevant to benchmarking: misconfigured Kubernetes clusters may underperform, leading to unfair comparisons with Swarm. Similarly, performance testing in CI pipelines, as described by SpeedCurve, emphasizes the need for realistic integration environments, warmed caches, and isolated performance tests to avoid resource contention and inaccurate results.[34] These ideas can be adapted to orchestrator benchmarking.

Lessons for a benchmark harness from existing studies

The accumulated body of academic and industrial work on container orchestration systems suggests several lessons for designing a benchmark harness comparing Kubernetes and Docker Swarm. First, workloads must be carefully chosen and standardized, reflecting both microbenchmarks and real-world application patterns.[2][23][24][37][37] Second, automation and reproducibility are paramount; frameworks like COFFEE and hermetically sealed benchmark containers illustrate how to encapsulate complexity in reusable components.[24][33] Third, observability must be integrated from the outset, including application-level metrics, orchestrator control-plane metrics, and infrastructure metrics such as storage and network performance.[29][43][44]

Fourth, fairness requires controlling for configuration differences and bias sources such as networking modes, image-pull behavior, cache state, and autoscaling policies.[19][25][39] Fifth, statistical rigor is essential: experiments must be replicated, variance measured, and significance tested, as demonstrated in the BlueField DPU studies and scheduling algorithm analyses.[30][38] Finally, benchmarking is inherently hard, and no single experiment can capture all aspects of orchestrator performance; harnesses should be designed to support multiple scenarios and evolving workloads, enabling iterative refinement and deeper understanding over time.[33][42]

Designing an Automated, Reproducible Benchmark Harness

Architectural overview of the harness

A robust benchmark harness for comparing Kubernetes and Docker Swarm consists of several logical components: the orchestrator under test, the benchmark workloads, load-generation engines, an observability and metrics collection system, and a harness controller that orchestrates experiments. The orchestrator component includes either a Kubernetes or Swarm cluster deployed on identical hardware, configured via infrastructure-as-code scripts to ensure reproducibility. The benchmark workloads are packaged as container images that can be deployed identically on both orchestrators, with configuration and resource specifications managed centrally.

Load-generation engines such as wrk, k6, Fortio, or Vegeta run on separate nodes, possibly encapsulated in their own containers, and are orchestrated by the harness controller to apply specified load profiles to workloads.[8][9][10][11] The observability system comprises Prometheus or equivalent metrics collectors, log aggregation (for example, via Elasticsearch or Loki), and dashboards (Grafana or vendor tools) configured to scrape and store orchestrator metrics (such as etcd, scheduler, CoreDNS, Swarm manager stats), node metrics (CPU, memory, network), and application metrics.[29][43][44]

The harness controller is the glue that defines experiments, provisions clusters, deploys workloads, configures load generators, triggers warm-up and measurement phases, collects metrics and logs, and performs initial analysis. It can be implemented as scripts, a dedicated service, or integrated into CI pipelines. To ensure fairness, the controller must enforce consistent configurations across orchestrators and randomize the order of runs when needed. Incorporating lessons from frameworks like COFFEE and hermetically sealed benchmark containers, the harness should treat benchmark workflows as reusable, parameterizable entities, with clear separation between orchestrator-specific modules and shared components.[24][33]

Experiment workflow: provisioning to report generation

A typical experiment workflow in the harness proceeds through multiple stages. First, the harness provisions the infrastructure for the orchestrator under test, either by launching VMs in a cloud environment or allocating bare-metal servers, and configures operating systems and storage as specified. Sysbench and fio microbenchmarks may be run at this stage to verify hardware performance matches expectations.[12][29] Second, the orchestrator—either Kubernetes or Swarm—is installed and configured according to predefined templates, with control-plane and worker nodes set up on identical hardware across orchestrators.

Third, the harness deploys the benchmark workloads using orchestrator-specific manifests (Kubernetes YAML, Swarm stack files) generated from a common template that ensures identical images, environment variables, and resource reservations. Any autoscaling policies or health checks are set equivalently. Fourth, the harness configures the observability stack, ensuring metrics scraping from nodes, orchestrator components, and application endpoints is functioning. This may involve deploying Prometheus, setting up scrape configurations for etcd, CoreDNS, kube-scheduler, and node exporters in Kubernetes, and analogous exporters in Swarm.[29][43][44]

Fifth, the harness defines and initiates a warm-up phase, during which load generators apply test traffic to workloads at a level sufficient to warm caches and stabilize performance metrics.[25][34] Metrics collected during warm-up are stored but not used for steady-state comparisons, except for analyses of cold-start behavior. Sixth, after monitoring indicates stable performance, the harness begins the measurement phase, running load generators with specified profiles for a defined duration or number of requests while collecting metrics and logs.

Seventh, upon completion of the measurement phase, the harness tears down workloads and orchestrator resources as needed, archiving logs and metrics. It then processes collected data, computing throughput, p50/p95/p99 latency, control-plane latency metrics, resource overhead, scale-out times, and failover times, as appropriate for the experiment. Statistical calculations, including sample means, variances, confidence intervals, and significance tests, can be performed either within the harness or offline using analysis tools.[26][30][38] Finally, the harness generates reports summarizing results for the current orchestrator, and, when runs for both orchestrators are complete, comparative reports that highlight differences and their statistical significance.

Implementing fairness controls in the harness

Ensuring fairness across orchestrator runs requires explicit controls within the harness. Configuration management must guarantee that hardware allocations, OS versions, kernel parameters, and container runtime versions are identical across Kubernetes and Swarm clusters. Workload deployment must use a single source of truth for container images and configurations, with orchestrator-specific manifests generated from parameterized templates that enforce identical settings. Resource reservations and limits must map semantically across orchestrators, and networking settings should be harmonized as much as possible, recognizing differences in implementation.

Cache state control is another fairness requirement. Before each run, the harness must prepare cache conditions, such as clearing or warming image caches, database caches, and HTTP caches, according to the experiment definition. For cold-start experiments, scripts may remove relevant images from nodes and restart services to clear caches; for warm-start experiments, a pre-defined warm-up load is applied until metrics stabilize. The harness should record cache-related events and metrics, enabling retrospective verification that cache conditions were as intended.[19][25][34]

Autoscaling and failure handling policies must also be aligned. If Kubernetes uses HPA policies based on CPU utilization to scale workloads, Swarm should be equipped with an external autoscaler or scripted mechanism that mimics similar behavior, or, alternatively, autoscaling should be disabled for both and static scaling used, depending on the experiment’s goals.[17][21][15] Node failure detection timeouts, such as Kubernetes pod-eviction-timeout and Swarm’s heartbeat periods, should be documented and, where feasible, configured to similar values to enable fair failover comparisons.[18][45] For failure injection experiments, the harness must apply identical procedures (such as stopping nodes, introducing network partitions) at the same logical points in time.

Randomization of run order and adequate spacing between runs help avoid bias due to residual system state or time-varying environmental conditions. The harness can randomize whether Kubernetes or Swarm is tested first for a given scenario, and it can ensure that clusters are re-provisioned or at least reset between runs. Logging of all configuration parameters, environment variables, and system timestamps allows auditing and debugging of fairness issues.

Integrating the harness with CI/CD and performance budgets

Integration of the orchestrator benchmark harness with CI/CD systems enables continuous monitoring of orchestrator performance and detection of regressions. SpeedCurve’s approach to performance testing in CI demonstrates how performance budgets can be defined and enforced such that builds are broken when performance falls below standards.[34] For orchestrator benchmarking, similar budgets can be defined—for example, requiring that p99 latency for a reference workload remain below a threshold, or that scale-out time for a standard scenario remain within a band around baseline values.

The harness can expose a CLI or API that CI pipelines invoke upon changes to orchestrator configurations, cluster images, or workload definitions. The pipeline may spin up a small test cluster, run a subset of benchmark scenarios, and parse results to determine whether budgets are met. Depending on organizational maturity and resource constraints, performance tests may either block the build on failures, report regressions without blocking, or merely log results for periodic review, mirroring the options described for SpeedCurve integration.[34] Early in adoption, teams may choose to log performance results without enforcement, then move toward stricter enforcement as confidence in the harness grows.

Automated performance testing in CI must balance thoroughness and speed. Full orchestrator benchmarking across all workloads and scenarios may be too slow for every commit, but nightly or weekly runs can catch regressions while per-commit pipelines run lighter smoke tests. The harness should support different profiles for “fast” and “full” runs, allowing CI to choose appropriately. Over time, the harness can be extended to support additional orchestrators, workloads, and metrics, turning it into a platform for ongoing performance governance.

Extensibility to other orchestrators and workloads

While this report focuses on Kubernetes and Docker Swarm, the benchmark harness should be designed to support additional orchestrators, such as HashiCorp Nomad, AWS ECS, or managed Kubernetes distributions, without fundamental redesign.[14][20][35] COFFEE illustrates how a framework can support multiple orchestrators via modular drivers that map generic workload and metric definitions to orchestrator-specific constructs.[24] Similarly, the harness can define an abstract orchestration interface that describes operations such as deploying a service, scaling replicas, inducing node failures, and collecting metrics, with concrete implementations for each supported orchestrator.

Workload definitions should be agnostic to orchestrators, relying on container images and metadata that can be translated into Kubernetes deployments, Swarm stacks, or Nomad jobs. As new workloads emerge, such as LLM serving systems or edge computing frameworks, they can be added as additional benchmark suites, leveraging the same load-generation and observability infrastructure.[23][36][37][37] This extensibility ensures that investments in harness development yield long-term value and can adapt to evolving technology landscapes.

Conclusion

Fairly comparing container orchestrators such as Kubernetes and Docker Swarm on identical hardware is both technically challenging and practically important. Orchestrators occupy a central role in modern cloud-native systems, and their performance characteristics influence everything from time-to-market to user experience and infrastructure cost. This report has argued that orchestrator performance must be viewed as a multidimensional construct, encompassing application throughput and latency (including tail percentiles like p99), control-plane and scheduling latency, pod and container startup time, resource overhead at the control-plane and node levels, horizontal scaling behavior, node-failure recovery times, networking and service-discovery overhead, pod density and cluster scale, and image-pull behavior and caching.[1][3][17][20][21][15][26][29][39]

Designing unbiased experiments to measure these dimensions requires careful control of hardware and software substrates, identical workload implementations and container images, consistent resource configurations, clear warm-up and steady-state phases with controlled cache conditions, adequate repetitions and statistical treatment, and robust observability.[5][5][24][25][29][33][42] Load-generation tools such as wrk, k6, Fortio, and Vegeta provide the means to drive HTTP workloads and measure throughput and latency, while Kubernetes-specific tools such as kube-burner and ClusterLoader2 stress-test scalability and control-plane behavior.[6][7][8][9][10][11] System-level benchmarks like sysbench and fio ground orchestrator performance in hardware realities, and observability stacks leveraging Prometheus, etcd metrics, and CoreDNS metrics reveal internal behaviors affecting performance.[12][29][43][44]

Common pitfalls—configuration asymmetries, network-mode mismatches, load-generator saturation, unbalanced caching, and inadequate statistical rigor—can severely bias conclusions if left unaddressed.[25][34][39][42] Academic and industrial studies on container orchestration systems, from functional and performance comparisons to edge evaluations and systematic benchmarking frameworks like COFFEE, provide valuable methodological lessons and underscore the need for automation, reproducibility, and comprehensive metric collection.[2][5][5][24][31][37][37] Industry analyses of Kubernetes and Docker Swarm, as well as practitioner reports on networking and operational issues, contextualize performance expectations and highlight practical tuning considerations.[1][3][21][15][32][39]

Building an automated, reproducible benchmark harness that embodies these best practices is a substantial engineering undertaking, but one that yields durable benefits. Such a harness enables organizations to make evidence-based decisions about orchestrator selection, tune configurations for specific workloads, detect performance regressions in CI/CD pipelines, and systematically explore trade-offs across scenarios. By adhering to principles of fairness, transparency, and statistical rigor, and by leveraging the rich ecosystem of load-testing, observability, and orchestration tools available, practitioners can move beyond anecdotes and unstructured experiments toward scientifically grounded comparisons of Kubernetes, Docker Swarm, and future container orchestration technologies.