ArmoniK First User Group - June 2026
Detailed report of the day
Organized by Aneo. Morning presentation by Wilfried Kirschenmann and Jérôme Gurhem. Present: CACIB and Natixis (clients), Intesa (prospect) and CentraleSupélec (research), as well as a French hedge fund. Other organizations in POC or collaboration (EDF, a car manufacturer, other European banks) were mentioned during the presentation without being represented in the room.
Agenda of the day
- 9:00-10:00 - Welcome / Coffee
- 10:00-12:30 - Morning session: overview of the latest release, product roadmap, answers to the questions submitted at registration
- 12:30-14:00 - Lunch
- 14:00-17:00 - Afternoon session: client testimonials, break, interactive workshops / working groups on monitoring and metrics
This report covers all of these: Aneo's morning presentation, the two client testimonials (Natixis and CACIB), and the write-ups of the two afternoon workshops.
1. Morning session - Aneo presentation
1.1 The vision: "the HPC platform for the most demanding applications"
The presentation started from a historical observation: three worlds of computing, long distinct, independently converged on the same abstraction.
The finance world (FSI). It began simply, with embarrassingly parallel Monte Carlo simulations run on HTC schedulers: because each task was independent, fault tolerance and malleability came "for free". Then quantitative algorithms grew more complex: XVA/CVA adjustments require portfolio-level aggregation, and Adjoint Algorithmic Differentiation (AAD) traverses the computation graph backward. These workloads are no longer trivially parallel: they involve deep, nested dependencies. Custom orchestration layers were grafted onto HTC schedulers to absorb this complexity, but the properties once free (fault tolerance, malleability, cloud portability) became hard to preserve.
The HPC world. It gathers supercomputer workloads with complex node-to-node communication, historically carried by MPI. MPI is powerful but unforgiving: communication patterns are hard to design and debug, a single node failure often forces a full restart, and malleability (adjusting the number of nodes at runtime) is nearly impossible. The HPC community responded by building task-based runtimes (StarPU, PaRSEC, Legion, OmpSs, Charm++) that decompose computation into tasks with explicit dependencies.
The cloud world. The cloud offers a compelling model (elasticity, near-unlimited resources, pay-per-use, spot instances that cut costs by 60 to 90%) but brings new complexities: malleability and fault tolerance (spots can be reclaimed at any time), data movement and lifecycle, multi-tenancy and security, latency, portability and cost. A security point was highlighted: legacy schedulers typically require inbound connections from the cloud to on-prem, which security teams reject; ArmoniK was designed with outbound-only connections from on-prem to the cloud, which removes that blocker.
Three worlds, one abstraction. The three worlds converged on the task graph, which decouples the algorithm's structure from its execution (re-running failed nodes, scheduling over variable resources, portability). ArmoniK goes further with a bipartite graph where both tasks and data are explicit, enabling data-aware scheduling, explicit data lifecycle management, and computation/communication overlap.
Dynamic graphs. A "Solve" task can, at runtime, inspect its input data and decide to decompose the problem by submitting a whole subgraph (new data, parallel Solve tasks, intermediate results, an aggregation step), then delegate the production of its result. It is the compute library (neither the orchestrator nor the user) that decides dynamically. Real FSI case: a pricing library receives a portfolio, splits it by counterparty, runs XVA adjustments in parallel, then aggregates; the splitting strategy depends on the portfolio contents and the algorithms. It cannot be known at submission time.
ArmoniK's positioning. ArmoniK is a production-ready task orchestrator, built on a bipartite graph and intended for critical environments, which makes it easier to develop, observe, and operate distributed algorithms. It rests on four pillars: industry-grade scalability (proven in finance), performance techniques inherited from HPC (computation/communication overlap, pipelining), a developer-centric and observable design, and a dynamic bipartite graph. It is released as open source and available on GitHub. The vision also details the target capabilities: ease of use and deployment, fault tolerance, observability, malleability, portability, data awareness, performance, and smart scheduling (priority, preemption, placement on heterogeneous resources).
In short, the vision frames ArmoniK as the culmination of a convergence: taking the abstraction common to the three worlds (the task graph) and enriching it with a bipartite, dynamic model designed for critical environments.
1.2 ArmoniK today: a mature product
This part gives the current state of the product: ongoing adoptions, technical ecosystem, proven capabilities, and the highlights of the latest production release.
On the finance side, several CACIB teams are already in production, Natixis is in the process of going to production, and a solid pipeline of POCs brings together hedge funds and other European banks. On the industry and research side, CentraleSupélec is running an internship on a seismic-imaging pipeline built on ArmoniK, EDF wants to experiment with a data-processing pipeline for physics simulations, and a POC is under way with a major car manufacturer.
Ecosystem and deployment. The ecosystem includes interoperable SDKs (C#, C++, Java), a language-specific Python SDK (Pymonik), APIs in C#, C++, Java, Rust, JavaScript and Python, as well as gRPC for any language. Adapters cover object storage (S3, MinIO, GCS), queues (SQS, PubSub, NATS, RabbitMQ, ActiveMQ) and the database (Percona MongoDB). Deployment runs on Kubernetes via Helm and Terraform (AWS, GCP, on-prem Kubernetes) or in local development mode. The observability stack combines Grafana dashboards, Prometheus metrics, CLEF structured logs, OpenTelemetry tracing, and an admin GUI (sessions, tasks, results, partitions).
Programming model. The model is based on a bipartite graph whose nodes are tasks and data. A task node is a single-node computation (possibly multi-threaded) taking one or more data items as input and producing one or more data items. A data node is an immutable data item depending on at most one task. Dependencies between tasks are expressed as data dependencies. The graph is potentially dynamic: tasks can add tasks and replace edges with subgraphs. The C# SDK has been redesigned: a Blob abstraction for the data lifecycle, explicit task-graph definition, and a modern asynchronous worker model. The C++ SDK gains parallelism, with upload, download and wait operations batched and parallelized through a thread pool.
SDKs. ArmoniK provides two families of SDK. The first, the interoperable SDKs (C#, C++, Java), ensures cross-language compatibility following a shared convention: they ease the development of clients and workers, provide pre-built worker containers, a high-level API, similar clients to define graphs and submit tasks, and conventions for error handling, idempotency and observability. The second, the language-specific SDKs, includes Pymonik (Python), in early access, targeting data science / ML workloads. The C++ and Java SDKs are now officially supported.
Highlighted qualities. Several qualities structure the product. ArmoniK is observable, through Prometheus metrics, Grafana dashboards, logs centralized with Fluent Bit, and an admin GUI. It is fault-tolerant: automatic recovery, idempotent operations ensured by a state machine, failure isolation without cascade, graceful degradation, and configurable grace periods. It is portable across cloud, on-prem, bare-metal Windows, Slurm supercomputers, and local development. It offers multi-cluster federation, today via a round-robin load balancer, with metric-aware scheduling and a global scheduler to come (under development). It aims to be easy to use and deploy, from local K3s to vanilla Kubernetes, through multi-cloud AWS/GCP, Docker Compose for Windows, and Slurm submission. Finally, it is malleable, with scaling adapting to each environment, and performant.
Performance. The figures cited report a sustained throughput of about 4,500 tasks/s on 2,000 workers, about 20 ms of scheduling overhead, and validation at more than 15,000 concurrent workers. On the efficiency side, ArmoniK relies on parallel transfers (prefetching, pipelining), a configurable streaming chunk size, a node-local data cache, and a decoupled architecture with stateless components.
Security & compliance. The setup includes an SBOM of the orchestration components, CVE scanning of images and dependencies, distroless images to reduce the attack surface, mTLS for all inter-service communication, encryption at rest and in transit, and auditability via structured logs and traceability. In progress: extending the SBOM, CVE scanning, and distroless images to the other components.
Open source. All components are public on github.com/aneoconsulting, under the AGPL 3.0 or Apache 2.0 license, with an active user group, a public issue tracker, and regular releases.
New production release - ArmoniK v2.23.0. Two cadences are now distinguished: the rolling release (every 4 to 6 weeks, with all the newest features and CVE fixes in the following release) and the production release (LTS, every 6 months, 9 months of support per version, patches every 4 to 6 weeks), providing stability, predictable upgrade cycles, and long-term security support. The v2.23.0 improvements fall into two areas:
- Performance: node-local filesystem cache (configurable eviction threshold); configurable streaming chunks (fixed 84 KB replaced by a 2 MB server-side default, fewer gRPC messages); C++ SDK parallelism; migration of all orchestrator components from .NET 8 to .NET 10 LTS (no impact on clients/workers running an older runtime).
- Developer experience & ecosystem: redesigned C# SDK and official support for the C#/C++/Java SDKs; declarative initialization (partitions, roles, authentication configured via structured environment variables, applied idempotently at startup, with no manual database intervention); ArmoniK Load Balancer now natively deployable on Kubernetes (TLS/mTLS, ingress, GUI integration) and integrated into the reference deployments; new NATS JetStream queue adapter (durable persistence); distroless images for all orchestrator components.
Key takeaway: ArmoniK is no longer an emerging project but a platform deployed in production at several financial players, with a complete multi-language ecosystem and a deliberate release discipline, the concrete foundation on which the product vision can build.
1.3 The roadmap
The roadmap is organized around two mutually reinforcing pillars, easier to use and more performant, converging toward a dual delivery model: a managed PaaS (Aneo operates the control plane, the client keeps data and compute in its own tenant) and a custom-deployable solution (on-prem, bare-metal, supercomputer). Five thematic groups structure the work.
- Core engine - persistence & control plane. Foundation: migration from MongoDB to a relational SQL database (strong consistency, better scalability, easier access to managed services such as RDS/Aurora or on-prem PostgreSQL, a key enabler for the managed PaaS). Above it: refactoring of the control plane, which becomes the scheduling service (stateless, with idempotent RPC APIs): the current scheduling agent then no longer having direct access to the database or the queue (horizontal scalability). Then optimization (batching of database operations, task prefetching, fewer database round-trips). Finally, data-plane isolation via a dedicated Data Agent separating control traffic from data traffic.
- Scaling & multi-region resilience. Kubernetes-native autoscaling operator (Prometheus metrics + Karpenter) across multiple clusters/regions, automated partition provisioning, a workload-routing layer (region-aware placement for latency and compliance, on-prem to cloud overflow). Disaster recovery built on three pillars: globally distributed SQL for metadata, queues rebuildable from the database, cross-region object-storage replication. Region bursting: overflow to another region with isolated data planes (no data leakage).
- Deployment portability. A spectrum from managed to custom: Cloud Marketplace (one-click deployment, billing integration) → Cloud IaaS / K8s (Helm + Terraform, reference deployments) → Bare Metal (process-based lifecycle, without Kubernetes, for regulated or legacy environments) → Supercomputer (Slurm integration, for the scientific community: nuclear fusion, seismic imaging, automotive crash).
- Security, quality & observability. Secure data access (token-based authorization, tenant isolation), user management (RBAC/ABAC, SSO and identity federation), non-regression testing (automated test suites, performance benchmarks, quality gates, particularly important for the managed PaaS), functional observability (critical-path analysis, comparison of task metrics across runs), and unified infrastructure observability (centralized dashboard: hardware, network, application).
- Scheduling, data management & ecosystem. Advanced scheduling (fair share, metric-driven, task/data affinity). New categories of data dependencies beyond the mandatory ones: optional (the task can proceed without it), lazy (fetched on demand rather than prefetched), and lazy-optional. Task checkpointing (save/restore state for long-running tasks). OpenGRIS integration (ParFun SDK, ParGraph SDK, Scaler metascheduler) as a gateway to the broader HPC ecosystem.
Roadmap key message. It is a vision and a priority list, not a dated plan. This prioritized vision will make it possible to refine a schedule with the support of our clients and partners.
1.4 Answers to the questions submitted at registration
Aneo closed the morning by answering the questions submitted by participants at registration. Five topics were addressed.
How to migrate to the latest release? Breaking change: strict enforcement of runAsNonRoot and a fixed runAsUser in the pod securityContext: worker images running as root (UID 0) are rejected by the admission controller (the securityContext must be aligned, e.g. UID/GID 5000). Partition configuration now goes through indexed environment variables (InitServices__Partitioning__Partitions__0, __1, …), each holding a JSON string; to update a partition, edit the variable and redeploy with InitServices__InitDatabase=true.
How to use ArmoniK for scientific computing? Typical patterns are parameter sweeps, Monte Carlo, iterative solvers (sub-tasks expressing iterations recursively), and finite-element / finite-difference methods (partitioning the domain into sub-tasks). In practice: use the Python or C++ SDK for worker logic, Python to express complex graphs, exploit data locality through shared dependencies, package the solver in a Docker image, enable retries (transient failures happen at scale), and monitor via the bundled Grafana dashboards.
How to deploy on bare metal? Prerequisites are Docker and Docker Compose, and optionally a shared storage backend (Ceph, NFS, on-prem MinIO), a queue (ActiveMQ or RabbitMQ), and a database (MongoDB). Deployment consists of configuring the docker-compose.yml (worker, storage, queue, database), starting the services on each node, then validating with the provided smoke-test job. Important caveat: the Docker Compose deployment is not yet production-ready.
How does ArmoniK compare to similar tools? A comparison table positions ArmoniK against Airflow, Ray/Dask, SLURM, DataSynapse and Symphony on the axes: fault-tolerant tasks, dynamic graphs, multi-cluster/hybrid, data-aware scheduling, Kubernetes-native, financial-grade security, open source, and computation/communication overlap. ArmoniK's niche: production-grade orchestration for dynamic, fault-sensitive, high-volume workloads, in regulated or multi-site environments, where correctness and auditability matter as much as throughput.
What technical skills are required? On the deployment side: Kubernetes (Helm, cluster administration, RBAC, namespaces), Infrastructure as Code (Terraform), networking (DNS, TLS, ingress, firewall, VPN/peering), storage (S3-compatible object storage, persistent volumes), and identity/access (OIDC/LDAP, certificates). On the operations side: observability (Prometheus, Grafana, log aggregation), incident handling (reading task states, diagnosing worker crashes, draining partitions), upgrade management (rolling updates, schema migrations, rollback), and capacity planning (autoscaler tuning, quotas, cost attribution). On the development side, finally, writing workers and defining graphs requires mastery of the ArmoniK SDKs and of best practices for distributed application design.
These answers implicitly sketch ArmoniK's adoption profile: a powerful but infrastructure-demanding platform, whose use in POCs can be simplified (Python, bare-metal), but whose production operation requires solid Kubernetes and observability skills.
2. Afternoon session - Client testimonials
2.1 Natixis - "DataSynapse's journey to ArmoniK" (testimonial by William Leduc, Manager Leader, Calculation Core Services)
Feedback on the migration from DataSynapse, with the Forex and Commodities assets as the first use case. As of today, Natixis is not yet in production: the testimonial covers a well-advanced project whose go-live is still ahead (progressive go-live starting July 2026).
Project successes. The benefits of an innovative, cloud-ready solution were highlighted: easy adaptation to demand variations and resource scaling, containerized infrastructure (better availability), and optimization of operational costs through a pay-as-you-go model. The partnership with Aneo played a key role: continuous support across the whole project lifecycle, a Tech Lead ensuring the handover from build to run, training, and flexibility. On the Natixis side, notable factors include the availability of the Container Infrastructure team, useful design skills, an established partnership between IT Infrastructure and GM (Global Markets), and growing stakeholder involvement as the project became real.
A complex integration, carried out with close support. Deploying ArmoniK in a production environment, multi-cluster and in a regulated, secured zone, is demanding by nature. Several technical topics required fine tuning: compilation, Kubernetes architecture and policies, secret management, library conflicts. On each of these, Aneo supported the Natixis teams: diagnosing and fixing worker crashes via the SDK, setting up a dedicated network zone for the ArmoniK grids, automating the CI/CD, and deploying the monitoring stack. Tuning focused on long-running jobs and Kubernetes pod management, and a first version of the ArmoniK Load Balancer was put in place for multi-cluster. Joint benchmarking made it possible to identify the performance areas to work on, leading to an impact analysis and a shared migration strategy; on the Aneo side, the effort focused on fixes, optimizations, interoperability, and SDK packaging.
IT architecture. The Forex and Commodities clients call two Kube/ArmoniK sites through an F5 (network load balancer) and the ArmoniK Load Balancer. Each site comprises a control plane (3 K8s masters) and worker nodes, the ArmoniK Core block (Admin GUI, Node Exporter, Control Plane, Pod Deletion Cost Updater, Metrics Exporter, Ingress), and external images (Prometheus, MongoDB, Grafana, ActiveMQ, Redis, Scality S3, Fluent Bit). A multi-cluster ArmoniK topology in active/active was set up: the ArmoniK Load Balancer is replicated on OpenShift, each replica driving the two downstream ArmoniK clusters, via one F5 per cluster. This component, referred to as "Meta Control Plane" in the presentation, is a load balancer that understands ArmoniK requests.
In conclusion, the Natixis journey shows that deploying ArmoniK in a critical, regulated environment is demanding, but succeeds when it relies on close support. The value of the partnership with Aneo, from build to run, and the gradual stabilization of the platform were decisive. The topics worked on together (multi-cluster, monitoring, performance) also align with several roadmap themes.
2.2 Crédit Agricole CIB (CACIB) - "ArmoniK Feedback" (testimonial by Idhir Madjour, ArmoniK Chapter Leader / Tech Lead Pricing Service GMD)
Feedback on the migration from IBM Symphony: an ArmoniK deployment in production on demanding FSI workloads. The presentation, structured in four parts (migration strategy, scalable data plane, improvements, conclusion), illustrates a real ramp-up whose pain points were identified and then resolved.
Migration strategy - "lift and shift then refactoring". The business task graph is complex and is modeled differently in Symphony and in ArmoniK. Issue encountered: the workload behavior and the SLA were no longer the same after the switch (the Symphony and ArmoniK task trees differ structurally). An ArmoniK refactoring strategy was therefore applied, with results including reduced memory consumption and a reduced package size of the services.
Scalable data plane. Pushing ArmoniK to large scale, CACIB revealed the scalability limits of two legacy components: ActiveMQ and MongoDB did not keep up with the growing number of workers. Concretely, ActiveMQ reached a threshold beyond which messages were neither produced nor consumed, and MongoDB's CPU saturated durably, severely degrading the system. These pain points were resolved: switching to SQS (scalable and managed by AWS), a new task-distribution algorithm and moving MongoDB to a Nitro-generation EC2 instance (AWS's near-bare-metal virtualization, which frees up more CPU and I/O throughput for the database), then migration toward a sharded MongoDB approach for better resource management (tests ongoing). The targeted scalability is now achieved on the data plane.
Improvements and contributions. Because the ArmoniK model relies on coordinated distribution, pressure on MongoDB grows with the number of workers: the path considered is to rework the flow from the task graph (PricingService). On their side, CACIB relies on a metascheduler built into their compute library, which adjusts task granularity and thereby limits this pressure. For interactive pricing, CACIB contributed an SQS fair queue and studied the benefit of hot-start pods. Finally, faced with the rise of attacks exploiting AI and supply-chain techniques, the need for better management of the components used (Docker images, NuGet and npm packages) was highlighted. Several of these efforts directly benefit the product and the community.
Key takeaway: the CACIB feedback shows that ArmoniK holds up in production on critical, high-volume workloads: the scalability limits hit on the legacy components (ActiveMQ, MongoDB) were identified and then resolved (SQS, sharding, new distribution algorithm). Along the way, the work done by CACIB, including an SQS fair-queue contribution, strengthened the product for all high-volume deployments. The remaining topics, such as supply-chain security, align with roadmap themes.
3. Afternoon workshops - Monitoring & metrics
3.1 Workshop 1 - Optimization & infrastructure (facilitated by Wilfried, Aneo)
This workshop brought together Aneo and users around a challenge: making computations on ArmoniK faster and more efficient. Participants explored how to adapt task size, track memory consumption, and better observe application behavior.
Adapting task size from history (Intesa feedback). Giorgio (Intesa) shared a method used in their compute library with DataSynapse. For simulations (Monte Carlo type) to be efficient, tasks must be neither too short (under half a second) nor too long (over 30 seconds). To find the right balance, their application measures the time of each computation and stores it in a cache with a unique key tied to the contract (warrant or trade number). Before launching a new computation, the application consults this cache to know the past duration and automatically adjusts the size of the next tasks (e.g. the number of Monte Carlo paths). ArmoniK will need to be able to reproduce this intelligent behavior.
Grouping tasks and using performance alerts (CACIB feedback). Yanik (CACIB) explained their technique for grouping tasks (packing). To avoid sending too many small computations that would overload the database, their application probes at startup to estimate processing durations and bundle tasks. CACIB would like ArmoniK to return performance data (such as the total time of computations for a same family of trades). This information would let the application learn from the past to better split subsequent computations. This is all the more useful because complex financial products (exotics) change behavior and become simpler to compute as they approach their maturity date.
The critical memory problem (Intesa / Natixis). Giorgio (Intesa) recalled that, without task grouping, some complex computations require up to 4 GB of memory. The real cost is not the hardware, but the time the system loses allocating that memory. William (Natixis) pointed out that this problem is hard to track. When a task exceeds the allowed memory, Kubernetes kills the server instantly. The measurement tools do not have time to see the spike and return an incomprehensible error. An ArmoniK feature able to raise an alert as soon as memory reaches 80% would be very useful.
Monitoring drift and hardware impact (Natixis / CACIB). William (Natixis) explained that with each new version of their compute library, they measure CPU and memory consumption per portfolio. This checks that code changes do not slow the application down. Finally, hardware choice is crucial. On the CACIB network, which mixes four processor generations, the same computation can take twice as long depending on the machine. Using optimized compilers (such as Intel ICC) improves speed, but the orchestrator must know which machine the computation ran on so as not to skew the statistics.
Product framing - a new cross-session metric-sharing feature. This workshop helped define a future ArmoniK feature. The goal is to store and retrieve measurements from one compute session to another:
- Input (prefetch): when launching a computation, the user will be able to ask ArmoniK to fetch the last known value of a business metric. ArmoniK will pass it directly to the compute program at startup.
- Output (persistence): at the end of its execution, the task will be able to return new measurements. ArmoniK will keep them durably, even after the session is closed.
- Usefulness: this tool will enable analysis of computation drift over the long term. It can also be used to automatically optimize the size and construction of future tasks on the fly.
3.2 Workshop 2 - ArmoniK metrics (facilitated by Florian, Aneo)
This was a needs-gathering workshop. After a recap of the metrics already exposed, the floor was given to clients on six themes (metrics, observability, capacity planning, scaling prediction, scheduler behavior, worker-level instrumentation). Underlying tension, recurring throughout the day: where to draw the line between the primitives provided by ArmoniK and the operational intelligence built by the client.
Existing and gaps. ArmoniK already exposes tasks by state (global and per partition), agent timings, task duration, Kubernetes infrastructure metrics (CPU, RAM, pods), and, optionally, MongoDB. Two weaknesses stand out: a counter of completed tasks per agent exists but is undocumented, and the I/O data size per task is not materialized as a metric even though it is in the database.
Client requests. Network/transfer observability (Intesa) is the most recurring need: an I/O-bound computation is often mistaken for a slow one, hence a request for bytes and throughput per agent and for distinguishing compute from transfer. Natixis wants worker-level instrumentation (internal timings, per stage, per model, regression across releases): the path considered would be standardized hooks or endpoints letting the worker publish its metrics through the Prometheus pipeline, without ArmoniK owning these application metrics. CACIB wants a cumulative per-pod task counter, accessible from the task code, to detect memory leaks. Aneo will not provide it as standard: exposing this counter to the task would make its behavior depend on the pod's history (the number of tasks already processed), which would break the expected determinism and idempotency. Given this architecture, it is preferable that the user computes this data itself on the worker side, which is entirely feasible: the need therefore remains addressable, but outside the scope provided by ArmoniK. For per-business-line metrics (Natixis), the recommendation is to use separate partitions rather than Prometheus labels (cardinality). Pods per host, finally, are already computable in PromQL.
Scheduler and scaling. The major point is partition starvation: with no quota or guaranteed minimum, one partition can starve another indefinitely (current workaround: a hard-coded HPA formula, which remains fragile). Proposed solution and likely roadmap candidate: per-partition min/max pod counts, analogous to Kubernetes requests/limits. Other topics: time-of-day partition weighting (to be exposed as parameter metrics for PromQL-based control, preferably avoiding yet another DSL); downscaling already solved via a pod-deletion-cost annotation that preserves the most valuable tasks, whose computed cost could be exposed as a metric; and queue-velocity metrics (inflow, consumption, completion rates).
Predictive metrics (Natixis). For overnight batches of 1,000 to 10,000 tasks, the client wants to scale up preemptively. ArmoniK alone has no signal to predict: it needs task labels from the client. Per-task duration prediction is tractable (moving average per label); session-level submission dynamics are much harder. Path considered: an optional predictor service driven by labels, exposing a Prometheus metric, with actuation staying on the client side. Because these metrics are highly environment-specific, the direction is to expose primitives and let the client compose the formula to suit their needs.
Multi-cluster (Natixis). The topic was the trade-off between data locality and instance price across providers and regions. Aneo clarification: a session is already bound to a single cluster at creation, so data locality is ensured "for free". The real need is a central Prometheus aggregating a subset of cross-cluster metrics, plus cost signals to weight placement, it being understood that spot pricing makes any cost metric approximate.
Paths and takeaway. Several paths were raised, without any commitment: documenting the metric/dashboard mapping, adding per-agent I/O metrics, exposing pod-deletion-cost and partition definitions (weights, min/max) as metrics, prototyping a predictor service, designing a cross-cluster Prometheus aggregation, or revisiting the per-partition min/max design. Strategic takeaway: clients are moving from basic observability to predictive operational intelligence, and the emerging line is to expose rich primitives rather than embed the client's business logic. Four themes dominate, by frequency: predictive scaling, transfer/network observability, partition fairness, worker-level instrumentation.
4. Key messages of the day
- A clear, assumed thesis: three worlds (FSI, HPC, cloud) converged on the task graph; ArmoniK pushes further with a bipartite and dynamic graph where data is a first-class citizen, the foundation of data-aware scheduling and explicit data lifecycle management.
- Maturity proven in production: several CACIB teams are in production, Natixis is engaged in its go-live (effective go-live in July 2026), and POCs exist beyond finance (EDF, automotive, research). The new production release v2.23.0 marks the shift to a release discipline (rolling vs production) expected by production clients.
- A roadmap steered by a single north star: everything converges toward the dual delivery model (managed PaaS + custom-deployable), driven by the SQL migration, control-plane decoupling, multi-region resilience, and deployment portability. The roadmap is a prioritized vision. This prioritized vision will make it possible to refine a schedule with the support of our clients and partners.
- Client feedback bears on themes shared with the roadmap: CACIB (ActiveMQ and MongoDB limits, SQS/sharding migration) and Natixis (multi-cluster via the ArmoniK Load Balancer, worker stability, tuning) overlap with the core-engine and scaling axes, shared priorities, with a schedule to be framed with the support of our clients and partners.
- The hot topic of the year is operational observability: the two workshops converge on the same demand, moving from raw metrics to predictive intelligence, and on the same direction: exposing rich primitives rather than embedding the client's business logic. Four priorities emerge: predictive scaling, network/transfer observability, partition fairness, and worker instrumentation.
- Concrete action paths emerged from the metrics workshop, notably closing the metrics documentation debt, adding per-agent I/O metrics, or revisiting the per-partition min/max design to address starvation (a real friction point reported in production).
Stay tuned with the Armonik's community
Stay informed on ArmoniK
Interested to receive ArmoniK Newsletter few times a year?