In today's dynamic computing landscape, organizations not only demand immense processing power but also require seamless orchestration of distributed resources. While cloud platforms provide the essential hardware, the true challenge lies in managing distributed workloads-coordinating task scheduling, ensuring efficient data transfers, and maintaining robust fault tolerance across diverse infrastructures.
ArmoniK meets these challenges head-on by abstracting the underlying complexities of distributed computing. By automating critical processes such as task scheduling, resource allocation, and failure recovery, it streamlines both the development and deployment of applications. This innovative approach allows developers to concentrate on core functionalities and innovation, ultimately leading to optimized performance and scalability in modern computing environments.
In this article, we explore ArmoniK's performance and scalability, particularly its integration with AWS. We'll cover key features like task graph distribution, language flexibility, fault tolerance, and elastic scaling. Our benchmarking will showcase ArmoniK's efficient resource utilization, achieving high Worker utilization and throughput under significant loads, including tests with up to 2,000 task handlers processing 10 million tasks.
Additionally, we'll discuss how ArmoniK's integration with AWS services enhances performance, scalability, and reliability. By the end, you'll understand ArmoniK's capabilities and its potential to optimize high-performance computing in the cloud.
With the challenges clearly outlined and ArmoniK introduced as the solution, let's explore what makes this platform tick.
ArmoniK is an open-source platform designed to address the growing demands of high-performance and high-throughput computing across heterogeneous infrastructures[1]. As a computation orchestrator, ArmoniK simplifies the development and deployment of distributed applications by abstracting away the complexities of task scheduling, data management, and fault tolerance [2].
At its core, ArmoniK allows developers to focus on creating their applications without worrying about the underlying distributed execution details. The platform automatically handles task distribution, resource allocation, and failure recovery, ensuring reliable performance even in dynamic environments.
Now that we have an overview of its purpose, let's dive into the specific features that empower ArmoniK.
By handling these functions, ArmoniK empowers developers to concentrate on application logic while the platform manages the complexities of distributed computing, including resource allocation and data management.
These features illustrate the versatility of ArmoniK. Up next, we'll clarify some of the technical terms used throughout the article.
Now that we're on the same page with the terminology, let's see how the performance of ArmoniK is rigorously evaluated.
We developed a comprehensive testing framework that simulates real-world computational workloads. Our evaluation focuses on three performance indicators:
The initial configuration uses 120 workers (each with 1 vCPU and 1 GiB of RAM) spread over 16 clients. Workloads are adjusted dynamically to simulate scenarios from 1 million to 10 million tasks. By varying one parameter at a time -either the workload size or the number of workers-we isolate and analyze the impact on overall performance. This ensures a controlled distribution of computational load while maintaining a realistic execution environment.
The testing strategy begins with setting up the environment and allocating the required workers. Once the system is configured, the benchmark is initiated by launching parallel jobs, starting with 1 million tasks evenly distributed among the 16 clients. As the test progresses, the workload is gradually increased by modifying the number of tasks assigned per client while scaling the number of workers accordingly. The evaluation encompasses the following scenarios, designed to systematically analyze the impact of varying workloads and worker counts:
| Number of workers | Number of tasks (per million) |
|---|---|
| 120 | 1 |
| 250 | 3 |
| 500 | 5 |
| 1000 | 7 |
| 2000 | 10 |
To ensure precise analysis, if system limitations arise, only one variable-either the workload or the number of workers-will be increased at a time. This approach allows for the isolated examination of each variable's impact on system performance.
Throughout the benchmarking process, performance metrics are continuously recorded, including throughput and resource utilization. The collected data is then analyzed to assess how efficiently ArmoniK handles varying workloads and scales under increasing demand. This methodology provides a structured and comprehensive evaluation of ArmoniK's capabilities across different computational scenarios.
This methodology lays the groundwork for our next discussion: understanding the types of tasks used to simulate different conditions.
To represent diverse real-world conditions, we designed task profiles that include:
With these profiles established, it's time to explore how the system manages and schedules these tasks.
In containerized environments, ArmoniK's scheduling agent is instrumental in managing tasks and coordinating workload execution. Operating alongside a worker within the same pod, the agent uses a specialized algorithm to efficiently allocate tasks. It also manages database interactions, including data retrieval, storage, and the generation of new tasks. Additionally, the agent handles error management by retrying or resubmitting failed tasks. By confining both the scheduling agent and the worker to a single partition, the system maintains a well-structured and organized architecture, enhancing overall efficiency and reliability.
By abstracting and decoupling orchestration and storage interactions, the scheduling agent allows users to develop applications in ArmoniK without worrying about the underlying storage type. This flexibility enables seamless adaptation to different environments without modifying the worker or client code, except for changes to the partition or endpoint configuration.
The scheduling agent's role in managing these critical functions makes it an essential component in modern containerized architectures, ensuring robust performance and adaptability.
Having looked at how tasks are scheduled, let's now examine the system's responsiveness by testing its latency.
To evaluate ArmoniK's scheduling overhead in isolation, we used zero-work tasks-tasks that return a constant result without performing any computation. This approach eliminates processing noise and highlights the latency introduced by the control plane, queue management, and worker allocation systems.We simulated three submission patterns commonly seen in production. The first scenario submits tasks individually, representing fine-grained workloads. The second and third scenarios bundle tasks into batches of 10 and 100, respectively. For each scenario, we measured submission time, processing time, and the overall end-to-end latency (from job submission until all results are returned).
For each scenario, we measured submission time, processing time, and the overall end-to-end latency (from job submission to complete result retrieval). The results were visualized as cumulative distribution functions (CDFs), with the $x$-axis representing latency and the $y$-axis showing the fraction of tasks completed within that time. The CDFs clearly indicate that larger batches reduce per-task overhead, confirming that ArmoniK effectively amortizes fixed scheduling costs even under high throughput.
Figure 1: The CDF illustrates that individual tasks complete in about 40 ms on average (nearly all under 50 ms ), 10-task batches average around 60 ms (most under 70 ms ), and 100-task batches average approximately 140 ms (almost all under 150 ms )
In this figure, single-task batches complete in about 40 ms on average, with nearly all tasks finishing below 50 ms. Ten-task batches have an average of around 60 ms, with most tasks completing by 70 ms. Even with 100 tasks in a batch, the average end-to-end latency remains close to 140 ms, and almost all tasks finish under 150 ms. This performance illustrates how ArmoniK not only keeps overhead low across various submission patterns but also maintains an acceptable and nearly undetectable latency for users. The end-to-end latency stays well under a quarter of a second-even with larger batches-making the performance almost instantaneous from the user's perspective.
The latency analysis gives us insight into the system's responsiveness, setting the stage for our exploration of efficiency in handling short tasks.
ArmoniK demonstrates exceptional orchestration capabilities when processing short-duration workloads, a critical performance indicator for high-throughput computing environments. Our testing methodology utilized one-second tasks to evaluate system efficiency under realistic operational conditions.We systematically scaled from 1 to 2,000 workers executing standardized one-second tasks, closely monitoring worker occupation rates throughout each computational session. The data captured for a 2000-worker session reveals ArmoniK's impressive performance characteristics.
With efficiency established, let's move on to the core of the system: its task distribution architecture, which is key to these impressive results.
ArmoniK's high efficiency stems from its optimized task scheduling architecture:
This architecture is the backbone of ArmoniK's performance. Next, we'll quantify these capabilities with specific performance metrics.
Figure 2: Worker average usage plot for 2000 workers
The worker usage plot reveals three distinct phases:
Figure 3: Task handler operation duration and task throughput plots during 1-second-task session
Figure 3 illustrates task processing metrics for one-second tasks, revealing a mean task execution time of 2.42 seconds-very close to the total execution time, which indicates minimal scheduling overhead. With 99.3\% worker utilization, almost all compute time is dedicated to productive work, which translates to optimized resource use, reduced costs, and faster task completion.
Having reviewed these metrics, we now turn our focus to how efficiency and throughput work together to maximize performance.
The 99.3\% worker utilization rate means that for every 1,000 seconds of allocated compute time, 993 seconds are spent directly on productive task execution. This remarkably high efficiency is achieved despite the inherent challenges of short-duration task processing, where even minimal scheduling overhead can significantly impact utilization rates.
The marginal efficiency gap of 0.7\% results from the unavoidable latency between task acquisition and result transmission. Despite this constraint, ArmoniK's task distribution system demonstrates exceptional performance, effectively managing workloads with minimal impact on overall efficiency.
This level of efficiency translates directly to optimized resource utilization, reduced operational costs, and accelerated workload completion times for compute-intensive applications.
With efficiency and throughput clearly defined, our next topic is scalability-a key factor in today's high-demand environments.
Throughput is a critical metric that measures the rate at which tasks can be processed by a system. It is particularly important in production environments where high volumes of tasks need to be handled efficiently to meet operational demands. The ability to process tasks quickly and efficiently is essential for maintaining productivity and ensuring that computational resources are utilized effectively.
ArmoniK's throughput capabilities for zero-work tasks provide insights into the platform's scheduling efficiency. Zero-work tasks allow us to measure the orchestration overhead without the variability of actual computation time.
Our testing utilized 2,000 workers with 16 distributed clients submitting approximately 10 million zero-work tasks simultaneously. This configuration thoroughly tested the system's scheduling capabilities under significant load.
Figure 4: Task handler operation duration and task throughput plots during zero-work-task session
The performance data reveals three distinct phases in ArmoniK's task processing cycle:
Figure 5: Task throughput plot during the zero-work-task session
This performance translates to approximately 15 million tasks processed per hour or 360 million tasks in a 24-hour period. For enterprise environments with production computing needs, this means ArmoniK can process large workloads within production windows.
ArmoniK achieves this high throughput through pipelining, where orchestration and computations overlap. Workers request the next task before completing the current one, minimizing idle time between tasks. This overlap is crucial for achieving high worker utilization, as demonstrated by the 99.3\% utilization in one-second task benchmarks. This efficiency is possible because of the overlap between orchestration and computations, ensuring that workers are continuously engaged in processing tasks.
While the results indicate effective handling of both scheduling-bound and computation-bound workloads, there are opportunities for performance optimization, particularly in task acquisition and post-processing phases where significant overhead was observed.
Figure 6: Task handler operation durations during the zero-work-task session
Scalability is a critical aspect of ArmoniK's performance, determining its ability to handle increasing workloads efficiently. Our analysis evaluates how well ArmoniK scales across different configurations by assessing its performance with varying numbers of workers and tasks. We focus on understanding both the strengths and the bottlenecks that emerge as the system scales. We scale up to 2,000 workers due to cost (about 13k$ for the experiments from this article) and will make larger scale benchmarks in the future.
We evaluated scalability by systematically increasing the number of workers and measuring the resulting throughput. The workload consisted of tasks designed to simulate real-world computational demands, with each task requiring approximately one second to complete. Our findings demonstrate that throughput improves significantly with more workers, nearly doubling with each doubling of resources. However, as the worker count increases further, the rate of improvement slows, indicating diminishing returns due to resource contention or system overhead.
ArmoniK exhibits near-linear scalability for independent tasks, efficiently distributing them with minimal coordination overhead. However, for workloads with task dependencies, scalability is slightly less efficient due to the additional coordination required. Nevertheless, throughput still improves with additional workers, and enhancements in dependency management could further boost performance.
At very high worker counts, inefficiencies such as scheduling overhead and communication latency become noticeable, impacting overall performance. Despite these challenges, ArmoniK demonstrates robust scalability, particularly for independent tasks, with throughput closely following resource increases.
Below is an example of a scalability session using 1-second tasks:
| #Workers | #Tasks | Throughput |
|---|---|---|
| 1 | 200 | 1 |
| 120 | 50 k | 119 |
| 250 | 100 k | 248 |
| 500 | 100 k | 496 |
| 1000 | 200 k | 992 |
| 2000 | 400 k | 1980 |
Figure 7: Scalability plot for 1-second tasks from 1 to 2000 workers
This table illustrates how ArmoniK's throughput scales with the number of workers, providing a clear view of its performance capabilities under varying workloads. The analysis highlights ArmoniK's effective scalability while identifying areas for optimization, particularly in managing task dependencies.
After establishing scalability, it's time to see how ArmoniK leverages the power of AWS to further enhance its capabilities.
Runs beyond 2,000 workers were intentionally deferred in this article to maintain reliability, reproducibility, and cost-efficiency. The main reasons are:
What's next (in a follow-up article):
ArmoniK integrates seamlessly with AWS to provide a scalable, cost-effective orchestration platform. Running on Amazon EKS, the platform benefits from a managed Kubernetes environment that simplifies deployments by automating node provisioning, updates, and security configurations. This ensures high availability across multiple Availability Zones and minimizes the operational burden on deployment teams.
The solution leverages flexible Terraform configurations to support both on-demand and spot instances, allowing it to dynamically allocate resources based on workload demands. As compute resources are doubled, throughput scales nearly proportionally, which confirms the system's efficiency in handling increasing workloads while keeping costs under control.
In addition to the core compute infrastructure, ArmoniK utilizes several key AWS managed services to streamline operations. Amazon S3 is used for reliable, scalable object storage for tasks inputs and outputs. For messaging, Amazon MQ and SQS ensure that task orchestration remains robust and responsive even under high loads.
Overall, this deep integration with AWS not only enhances performance and scalability but also provides the production-grade reliability and security necessary for production workloads. This makes ArmoniK an ideal solution for organizations looking to efficiently scale their high-performance computing environments in the cloud.
ArmoniK on AWS stands out as a powerful solution for high-performance and high-throughput computing. Its ability to dynamically distribute task graphs, support multiple programming languages, and provide robust fault tolerance ensures that it can efficiently handle complex computational workloads. With benchmarks showing up to 99.3\% worker utilization and throughput exceeding 4,000 tasks per second, ArmoniK delivers exceptional performance while optimizing resource usage. Moreover, its deep integration with AWS services makes it a cost-effective, reliable choice for organizations looking to scale their computing environments in the cloud.
In short, this deep dive shows that ArmoniK is not just a technical marvel-it's a practical, scalable solution ready to meet modern computational demands.