ArmoniK on AWS: A Deep Dive into Performance and Scalability

Written by Mohamed Khairallah Gharbi | 16 February 2026

TL;DR

ArmoniK is an open-source platform for high-performance, high-throughput computing that abstracts the complexities of distributed application deployment.
By managing task scheduling, data handling, and fault tolerance, ArmoniK enables efficient resource utilization-even under heavy loads.
Benchmark results reveal worker utilization of up to 99.3 \% during one-second tasks and throughput of approximately 4,500 tasks per second for zero-work tasks with up to 2000 task handlers.
Scalability tests with up to 2,000 Workers processing 10 million tasks highlight the platform's robust performance, while deep integration with AWS services (EKS, S3, ElastiCache, and MQ) ensures cost-effective and reliable operations.
Larger-scale benchmarks (beyond 2,000 workers) were deferred for cost and quotas reasons; a follow-up article will present those results with full reproducibility.

Introduction

In today's dynamic computing landscape, organizations not only demand immense processing power but also require seamless orchestration of distributed resources. While cloud platforms provide the essential hardware, the true challenge lies in managing distributed workloads-coordinating task scheduling, ensuring efficient data transfers, and maintaining robust fault tolerance across diverse infrastructures.

ArmoniK meets these challenges head-on by abstracting the underlying complexities of distributed computing. By automating critical processes such as task scheduling, resource allocation, and failure recovery, it streamlines both the development and deployment of applications. This innovative approach allows developers to concentrate on core functionalities and innovation, ultimately leading to optimized performance and scalability in modern computing environments.

In this article, we explore ArmoniK's performance and scalability, particularly its integration with AWS. We'll cover key features like task graph distribution, language flexibility, fault tolerance, and elastic scaling. Our benchmarking will showcase ArmoniK's efficient resource utilization, achieving high Worker utilization and throughput under significant loads, including tests with up to 2,000 task handlers processing 10 million tasks.

Additionally, we'll discuss how ArmoniK's integration with AWS services enhances performance, scalability, and reliability. By the end, you'll understand ArmoniK's capabilities and its potential to optimize high-performance computing in the cloud.

With the challenges clearly outlined and ArmoniK introduced as the solution, let's explore what makes this platform tick.

ArmoniK: Orchestration for the Modern Computing Era

ArmoniK is an open-source platform designed to address the growing demands of high-performance and high-throughput computing across heterogeneous infrastructures[1]. As a computation orchestrator, ArmoniK simplifies the development and deployment of distributed applications by abstracting away the complexities of task scheduling, data management, and fault tolerance [2].

At its core, ArmoniK allows developers to focus on creating their applications without worrying about the underlying distributed execution details. The platform automatically handles task distribution, resource allocation, and failure recovery, ensuring reliable performance even in dynamic environments.

Now that we have an overview of its purpose, let's dive into the specific features that empower ArmoniK.

Key Features of ArmoniK

Task Graph Distribution: ArmoniK dynamically distributes computational task graphs and their associated data across available resources. Users can define partitions within these graphs to specify which sub-graphs use particular machines and configurations such as CPU, GPU, TPU, RAM, preemptible instances, etc... A partition is a logical division that allows users to optimize resource allocation by assigning specific tasks to the most suitable hardware, isolating workloads, and enhancing flexibility without modifying the core application code.
Data Management: ArmoniK efficiently manages data across distributed systems-a critical feature often overlooked by competitors. This capability ensures effective data handling, leading to enhanced performance and reliability.
Language Flexibility: With SDKs available for multiple languages (C++, Python, Rust, C#, Java), ArmoniK integrates seamlessly with existing codebases.
Fault Tolerance: Built-in mechanisms automatically detect and recover from failures, ensuring computational resilience.
Elastic Scaling: The platform dynamically adapts to workload variations, scaling resources up or down as needed.

By handling these functions, ArmoniK empowers developers to concentrate on application logic while the platform manages the complexities of distributed computing, including resource allocation and data management.

These features illustrate the versatility of ArmoniK. Up next, we'll clarify some of the technical terms used throughout the article.

Definitions

Batch Size: Number of tasks grouped together for submission.
Iteration: One execution cycle within a test scenario.
Submission Time: Time taken to send tasks to the control plane.
Processing Time: Duration required to complete the tasks after submission.
Total Time: Sum of submission and processing times (end-to-end latency).
Throughput: Number of tasks processed per second.
Session: A logical container holding tasks and related data (status, results, errors) that supports resumption and cancellation.
Partition: A logical grouping of tasks and resources. It allows users to assign specific tasks to particular machine configurations and isolate workloads for optimized resource management and system organization.
Worker: User provided gRPC service that receives and executes tasks sent by the agent. It represents user code.

Now that we're on the same page with the terminology, let's see how the performance of ArmoniK is rigorously evaluated.

Benchmarking Methodology

We developed a comprehensive testing framework that simulates real-world computational workloads. Our evaluation focuses on three performance indicators:

Efficiency: How well ArmoniK utilizes allocated resources.
Throughput: The rate at which tasks are processed.
Scalability: How performance improves as resources increase from 120 to 2000 workers.

The initial configuration uses 120 workers (each with 1 vCPU and 1 GiB of RAM) spread over 16 clients. Workloads are adjusted dynamically to simulate scenarios from 1 million to 10 million tasks. By varying one parameter at a time -either the workload size or the number of workers-we isolate and analyze the impact on overall performance. This ensures a controlled distribution of computational load while maintaining a realistic execution environment.

The testing strategy begins with setting up the environment and allocating the required workers. Once the system is configured, the benchmark is initiated by launching parallel jobs, starting with 1 million tasks evenly distributed among the 16 clients. As the test progresses, the workload is gradually increased by modifying the number of tasks assigned per client while scaling the number of workers accordingly. The evaluation encompasses the following scenarios, designed to systematically analyze the impact of varying workloads and worker counts:

Number of workers	Number of tasks (per million)
120	1
250	3
500	5
1000	7
2000	10

To ensure precise analysis, if system limitations arise, only one variable-either the workload or the number of workers-will be increased at a time. This approach allows for the isolated examination of each variable's impact on system performance.

Throughout the benchmarking process, performance metrics are continuously recorded, including throughput and resource utilization. The collected data is then analyzed to assess how efficiently ArmoniK handles varying workloads and scales under increasing demand. This methodology provides a structured and comprehensive evaluation of ArmoniK's capabilities across different computational scenarios.

This methodology lays the groundwork for our next discussion: understanding the types of tasks used to simulate different conditions.

Task Profiles

To represent diverse real-world conditions, we designed task profiles that include:

Independent Tasks: Executed without interdependencies.
Dependent Tasks: Form execution graphs where certain tasks must complete before others begin.
Zero-work Tasks: Measure pure scheduling overhead.
One-Second Tasks: Introduce realistic tasks that typically last around one second, reflecting common activities people perform within their tasks.

With these profiles established, it's time to explore how the system manages and schedules these tasks.

Scheduling

In containerized environments, ArmoniK's scheduling agent is instrumental in managing tasks and coordinating workload execution. Operating alongside a worker within the same pod, the agent uses a specialized algorithm to efficiently allocate tasks. It also manages database interactions, including data retrieval, storage, and the generation of new tasks. Additionally, the agent handles error management by retrying or resubmitting failed tasks. By confining both the scheduling agent and the worker to a single partition, the system maintains a well-structured and organized architecture, enhancing overall efficiency and reliability.

By abstracting and decoupling orchestration and storage interactions, the scheduling agent allows users to develop applications in ArmoniK without worrying about the underlying storage type. This flexibility enables seamless adaptation to different environments without modifying the worker or client code, except for changes to the partition or endpoint configuration.

The scheduling agent's role in managing these critical functions makes it an essential component in modern containerized architectures, ensuring robust performance and adaptability.

Having looked at how tasks are scheduled, let's now examine the system's responsiveness by testing its latency.

Testing ArmoniK's latency

To evaluate ArmoniK's scheduling overhead in isolation, we used zero-work tasks-tasks that return a constant result without performing any computation. This approach eliminates processing noise and highlights the latency introduced by the control plane, queue management, and worker allocation systems.We simulated three submission patterns commonly seen in production. The first scenario submits tasks individually, representing fine-grained workloads. The second and third scenarios bundle tasks into batches of 10 and 100, respectively. For each scenario, we measured submission time, processing time, and the overall end-to-end latency (from job submission until all results are returned).

For each scenario, we measured submission time, processing time, and the overall end-to-end latency (from job submission to complete result retrieval). The results were visualized as cumulative distribution functions (CDFs), with the $x$-axis representing latency and the $y$-axis showing the fraction of tasks completed within that time. The CDFs clearly indicate that larger batches reduce per-task overhead, confirming that ArmoniK effectively amortizes fixed scheduling costs even under high throughput.

Figure 1: The CDF illustrates that individual tasks complete in about 40 ms on average (nearly all under 50 ms ), 10-task batches average around 60 ms (most under 70 ms ), and 100-task batches average approximately 140 ms (almost all under 150 ms )

In this figure, single-task batches complete in about 40 ms on average, with nearly all tasks finishing below 50 ms. Ten-task batches have an average of around 60 ms, with most tasks completing by 70 ms. Even with 100 tasks in a batch, the average end-to-end latency remains close to 140 ms, and almost all tasks finish under 150 ms. This performance illustrates how ArmoniK not only keeps overhead low across various submission patterns but also maintains an acceptable and nearly undetectable latency for users. The end-to-end latency stays well under a quarter of a second-even with larger batches-making the performance almost instantaneous from the user's perspective.

The latency analysis gives us insight into the system's responsiveness, setting the stage for our exploration of efficiency in handling short tasks.

Efficiency with Short Duration Tasks

ArmoniK demonstrates exceptional orchestration capabilities when processing short-duration workloads, a critical performance indicator for high-throughput computing environments. Our testing methodology utilized one-second tasks to evaluate system efficiency under realistic operational conditions.We systematically scaled from 1 to 2,000 workers executing standardized one-second tasks, closely monitoring worker occupation rates throughout each computational session. The data captured for a 2000-worker session reveals ArmoniK's impressive performance characteristics.

With efficiency established, let's move on to the core of the system: its task distribution architecture, which is key to these impressive results.

Task Distribution Architecture

ArmoniK's high efficiency stems from its optimized task scheduling architecture:

Tasks are transferred from the persistent database to high-performance queues.
Workers poll the scheduler via a low-latency interface.
Tasks are dispatched immediately to available workers, minimizing overhead.
Completed tasks are rapidly processed and results returned.
Workers then promptly request the next available task, reducing idle time.

This architecture is the backbone of ArmoniK's performance. Next, we'll quantify these capabilities with specific performance metrics.

Performance Metrics

Figure 2: Worker average usage plot for 2000 workers

The worker usage plot reveals three distinct phases:

Rapid initialization phase: Worker utilization increases from 0\% to 98.1\% in just 45 seconds, demonstrating ArmoniK's efficient task distribution capabilities
Sustained peak efficiency: Worker utilization stabilizes at precisely 99.3%, maintaining this exceptional level for over 2 minutes
Controlled decommissioning: As the workload diminishes, worker utilization decreases gradually from 94.5\% to 0\%

Figure 3: Task handler operation duration and task throughput plots during 1-second-task session

Figure 3 illustrates task processing metrics for one-second tasks, revealing a mean task execution time of 2.42 seconds-very close to the total execution time, which indicates minimal scheduling overhead. With 99.3\% worker utilization, almost all compute time is dedicated to productive work, which translates to optimized resource use, reduced costs, and faster task completion.

Having reviewed these metrics, we now turn our focus to how efficiency and throughput work together to maximize performance.

Efficiency Analysis

The 99.3\% worker utilization rate means that for every 1,000 seconds of allocated compute time, 993 seconds are spent directly on productive task execution. This remarkably high efficiency is achieved despite the inherent challenges of short-duration task processing, where even minimal scheduling overhead can significantly impact utilization rates.

The marginal efficiency gap of 0.7\% results from the unavoidable latency between task acquisition and result transmission. Despite this constraint, ArmoniK's task distribution system demonstrates exceptional performance, effectively managing workloads with minimal impact on overall efficiency.

This level of efficiency translates directly to optimized resource utilization, reduced operational costs, and accelerated workload completion times for compute-intensive applications.

With efficiency and throughput clearly defined, our next topic is scalability-a key factor in today's high-demand environments.

Throughput

Throughput is a critical metric that measures the rate at which tasks can be processed by a system. It is particularly important in production environments where high volumes of tasks need to be handled efficiently to meet operational demands. The ability to process tasks quickly and efficiently is essential for maintaining productivity and ensuring that computational resources are utilized effectively.

Zero-work-task

ArmoniK's throughput capabilities for zero-work tasks provide insights into the platform's scheduling efficiency. Zero-work tasks allow us to measure the orchestration overhead without the variability of actual computation time.

Our testing utilized 2,000 workers with 16 distributed clients submitting approximately 10 million zero-work tasks simultaneously. This configuration thoroughly tested the system's scheduling capabilities under significant load.

Figure 4: Task handler operation duration and task throughput plots during zero-work-task session

The performance data reveals three distinct phases in ArmoniK's task processing cycle:

Initialization: From initial submission, ArmoniK shows a ramp-up period of approximately 4 minutes before reaching operational throughput. This represents the time needed to initialize resources and begin distributed processing.
Operational Phase: During steady-state operation, ArmoniK maintains an average throughput of approximately 4,200 tasks per second, with peaks reaching over 5,300 tasks per second. The throughput shows some variability but maintains consistent performance under sustained load.
Completion Phase: As tasks are completed, throughput gradually decreases as remaining tasks are processed and resources are released.

Figure 5: Task throughput plot during the zero-work-task session

This performance translates to approximately 15 million tasks processed per hour or 360 million tasks in a 24-hour period. For enterprise environments with production computing needs, this means ArmoniK can process large workloads within production windows.

ArmoniK achieves this high throughput through pipelining, where orchestration and computations overlap. Workers request the next task before completing the current one, minimizing idle time between tasks. This overlap is crucial for achieving high worker utilization, as demonstrated by the 99.3\% utilization in one-second task benchmarks. This efficiency is possible because of the overlap between orchestration and computations, ensuring that workers are continuously engaged in processing tasks.

While the results indicate effective handling of both scheduling-bound and computation-bound workloads, there are opportunities for performance optimization, particularly in task acquisition and post-processing phases where significant overhead was observed.

Figure 6: Task handler operation durations during the zero-work-task session

Scalability

Scalability is a critical aspect of ArmoniK's performance, determining its ability to handle increasing workloads efficiently. Our analysis evaluates how well ArmoniK scales across different configurations by assessing its performance with varying numbers of workers and tasks. We focus on understanding both the strengths and the bottlenecks that emerge as the system scales. We scale up to 2,000 workers due to cost (about 13k$ for the experiments from this article) and will make larger scale benchmarks in the future.

We evaluated scalability by systematically increasing the number of workers and measuring the resulting throughput. The workload consisted of tasks designed to simulate real-world computational demands, with each task requiring approximately one second to complete. Our findings demonstrate that throughput improves significantly with more workers, nearly doubling with each doubling of resources. However, as the worker count increases further, the rate of improvement slows, indicating diminishing returns due to resource contention or system overhead.

ArmoniK exhibits near-linear scalability for independent tasks, efficiently distributing them with minimal coordination overhead. However, for workloads with task dependencies, scalability is slightly less efficient due to the additional coordination required. Nevertheless, throughput still improves with additional workers, and enhancements in dependency management could further boost performance.

At very high worker counts, inefficiencies such as scheduling overhead and communication latency become noticeable, impacting overall performance. Despite these challenges, ArmoniK demonstrates robust scalability, particularly for independent tasks, with throughput closely following resource increases.

Below is an example of a scalability session using 1-second tasks:

#Workers	#Tasks	Throughput
1	200	1
120	50 k	119
250	100 k	248
500	100 k	496
1000	200 k	992
2000	400 k	1980

Figure 7: Scalability plot for 1-second tasks from 1 to 2000 workers

This table illustrates how ArmoniK's throughput scales with the number of workers, providing a clear view of its performance capabilities under varying workloads. The analysis highlights ArmoniK's effective scalability while identifying areas for optimization, particularly in managing task dependencies.

After establishing scalability, it's time to see how ArmoniK leverages the power of AWS to further enhance its capabilities.

Why larger-scale benchmarks were deferred

Runs beyond 2,000 workers were intentionally deferred in this article to maintain reliability, reproducibility, and cost-efficiency. The main reasons are:

Cost and quotas: Very large EKS clusters, along with their queues, caches, and storage, significantly increase costs and can hit AWS service quotas or rate limits, complicating consistent run-to-run measurements.
Reproducibility: Publishing infrastructure-as-code, dashboards, and raw datasets requires extra validation to ensure others can reliably replicate the results.

What's next (in a follow-up article):

Larger-scale runs (5k–10k workers), including multi-AZ and exploratory multi-region scenarios.
Dependency-heavy DAGs (Directed Acyclic Graphs) and mixed-duration workloads (sub-100 ms to multi-minute).

Integration with AWS

ArmoniK integrates seamlessly with AWS to provide a scalable, cost-effective orchestration platform. Running on Amazon EKS, the platform benefits from a managed Kubernetes environment that simplifies deployments by automating node provisioning, updates, and security configurations. This ensures high availability across multiple Availability Zones and minimizes the operational burden on deployment teams.

The solution leverages flexible Terraform configurations to support both on-demand and spot instances, allowing it to dynamically allocate resources based on workload demands. As compute resources are doubled, throughput scales nearly proportionally, which confirms the system's efficiency in handling increasing workloads while keeping costs under control.

In addition to the core compute infrastructure, ArmoniK utilizes several key AWS managed services to streamline operations. Amazon S3 is used for reliable, scalable object storage for tasks inputs and outputs. For messaging, Amazon MQ and SQS ensure that task orchestration remains robust and responsive even under high loads.

Overall, this deep integration with AWS not only enhances performance and scalability but also provides the production-grade reliability and security necessary for production workloads. This makes ArmoniK an ideal solution for organizations looking to efficiently scale their high-performance computing environments in the cloud.

Conclusion

ArmoniK on AWS stands out as a powerful solution for high-performance and high-throughput computing. Its ability to dynamically distribute task graphs, support multiple programming languages, and provide robust fault tolerance ensures that it can efficiently handle complex computational workloads. With benchmarks showing up to 99.3\% worker utilization and throughput exceeding 4,000 tasks per second, ArmoniK delivers exceptional performance while optimizing resource usage. Moreover, its deep integration with AWS services makes it a cost-effective, reliable choice for organizations looking to scale their computing environments in the cloud.

In short, this deep dive shows that ArmoniK is not just a technical marvel-it's a practical, scalable solution ready to meet modern computational demands.

References

[1]. https://github.com/aneoconsulting/ArmoniK

[2]. https://armonik.readthedocs.io/en/latest/

View full post