In the ever-evolving landscape of high-throughput computing (HTC), managing computational resources efficiently is a constant challenge. Unexpected spikes in demand can overwhelm local grids, leading to delays and inefficiencies. One effective strategy to address this issue is cloud bursting, which allows organizations to temporarily expand their computing capacity by integrating cloud resources.
This article explores how IBM Spectrum Symphony facilitates cloud bursting and shares our experience developing a plugin for Google Cloud Platform (GCP) that leverages Managed Instance Groups (MIGs).
Cloud bursting
Cloud bursting is a cloud strategy where an on-premise application bursts into the public cloud when the demand for computing capacity spikes. In the context of HTC, where tasks are often loosely coupled, less dependent on data locality and can be executed independently, cloud bursting provides a scalable solution to meet fluctuating workloads without over-provisioning local resources.
IBM Spectrum Symphony
IBM Spectrum Symphony is a powerful grid management tool designed for HTC environments. It efficiently distributes workloads across available resources, optimizing performance and throughput. Symphony is the HTC counterpart to IBM Spectrum LSF, which is tailored for HPC workloads.
Symphony abstracts computing resources into "slots," which represent a combination of CPU cores and memory. An example of a slot would be (1 core, 512MB). The Service-Oriented Application Manager (SOAM) can then allocate these slots to different applications based on demand.
The SOAM is one of the many daemons that make up Symphony. Among which is the Host Factory. The Host Factory is a component of Symphony that facilitates cloud bursting by provisioning and deprovisioning cloud instances as needed. It acts as an intermediary between Symphony and cloud providers, managing the lifecycle of cloud resources.
Host Factory
Overview
The Host Factory works by polling the Symphony cluster to assess resource utilization and demand for additional slots. This is done by "requestors". When we're in a resource deficit, the Host Factory provisions resources from cloud providers to meet the demand. It also releases the unnecessary cloud resources when demand decreases. The provisioning and deprovisioning are delegated to Host Factory plugins for specific cloud providers. The only thing that the Host Factory is aware of is the machine templates because it uses them to map the slots from the requestor into actual concrete machines.
The way that the Host Factory functions significantly limits how we interact with it. For one, the polling mechanism means that communication is unidirectional and periodic, which can lead to delays or mismatches in resource allocation. Moreover, the Host Factory cannot be immediately informed of changes initiated outside its control, such as instances created or destroyed by other systems. We'll discuss how the two aforementioned limitations impacted our design decisions and some ways of going around them.
Templates
To translate Symphony's abstract "slots" into concrete cloud resources, the Host Factory uses templates. These templates define the specifications of cloud instances, including:
- Instance Types: CPU, memory, storage configurations.
- Pricing Models: On-demand, spot instances, reserved instances, pricing of machines ...etc.
- Priority Levels: Determines which templates to use under different conditions. Templates are typically defined in JSON files and allow the Host Factory to select appropriate cloud instances that match the required slots.
Strategy for developing the plugin
Developing this cloud bursting plugin presented several potential strategies. Traditional Symphony plugins for other cloud providers often employed methods analogous to bulk virtual machine creation. However, for Google Cloud Platform (GCP), our objective was to create a more deeply integrated and efficient solution, leveraging GCP-native services to achieve scalability potentially reaching millions of cores.
One avenue initially considered was utilizing Google Kubernetes Engine (GKE). This approach envisioned deploying Symphony, along with the dynamically provisioned compute hosts, within a GKE cluster. A potential benefit was leveraging Kubernetes-managed autoscaling. However, this presented a significant trade-off: Kubernetes autoscaling typically relies on resource consumption metrics (like CPU or memory usage), whereas Symphony's Host Factory ideally scales based on pending task demand. This mismatch in scaling triggers could lead to suboptimal resource allocation. While GKE integration remains a possibility for certain use cases, this fundamental difference prompted further investigation into alternative GCP services.
Leveraging Google Cloud Platform's Managed Instance Groups
Managed Instance Groups (MIGs) in GCP are a powerful abstraction for managing collections of virtual machine instances. They offer features such as auto-scaling, auto-healing, and simplified instance management.
Benefits of Using MIGs:
- Auto-Healing: Replaces unhealthy instances without manual intervention.
- Consistency: Ensures all instances are uniform, simplifying configuration management.
- Faster GPU Instance Provisioning
We chose MIGs to leverage these benefits, aiming to create a more robust and scalable cloud bursting solution. Additionally, the recent introduction of flexible MIGs ([see our article on Managed Instance Groups]) allows for preemption-rate aware provisioning for Spot VMs while also greatly shaving away from the complexities of managing multiple MIGs for different instance types.
What went wrong
While we assembled a proof of concept that worked with small tests that consisted of bursting to about 100-150 machines in our github workflows, we ran into significant challenges, the core issues were twofold:
- Conflicting Management: MIGs can create or destroy instances independently to maintain the desired group size, which may conflict with the Host Factory's actions.
- Resource Discrepancies: Instances created by MIGs outside the Host Factory's control are not recognized, leading to misalignment in resource tracking. The Host Factory did not request for these resources to be created, so it never asks for them to be destroyed.
A commonly reoccurring scenario is as follows: We use Spot VMs for our cloud bursting machines, when these machines are destroyed,our MIG's auto-healing feature is triggered, and it attempts to recreate the instance. The problem is that while this new instance is "valid" and "usable", the Host Factory doesn't claim ownership over it. It's not aware that it's there to replace one of the previous instances. The only thing we can tell the Host Factory through its limited polling interface is that that old instance has been destroyed. We're left with a zombie instance. The only viable action within the Host Factory's control is to request its deletion. This is extremely inefficient.
An Alternative Approach: Skipping the Host Factory
Given the limitations, we explored an alternative by bypassing the Host Factory and using custom requestors.
Requestors are components that monitor cluster demand and request resources accordingly. By customizing the requestor, we can have greater control over resource management. We can create a requestor plugin that monitors our Symphony cluster, and then, instead of routing this information to the Host Factory, which can then request individual machines. We can use this information to directly set the target size of the MIGs based on the resource requirements.
We can still use pricing information, priority and other criteria to determine which MIG to scale. Or in the case of flexible MIGs, we can delegate a good significant complexity to GCP, and rely on preemption-aware provisioning.
By directly setting the target size without passing through the Host Factory, we can fully leverage the capabilities of MIGs and create a more seamless and efficient cloud bursting experience. After all, ownership of the cloud resources directly falls into the hands of the MIG. Any instances created by the MIG will automatically connect to our main Symphony cluster thanks to our startup scripts.