Choosing Google Cloud MIGs for Dynamic Scaling: A Technical Deep Dive

Managed Instance Groups (MIGs) offer a robust and automated way to manage large numbers of virtual machine instances, making them an appealing choice for scaling resources dynamically in high-throughput computing environments. Below, we explore the key features of MIGs and their applicability to cloud bursting with IBM Spectrum Symphony, as well as the challenges and limitations encountered.

Why we went with MIGs

Instances managed by MIGs tend to be more reliable because MIGs can automatically repair instances that fail health checks (e.g., instances that are unresponsive or crash during their workload) while preserving their state, including attached disks, instance names, and metadata. Additionally, when working with Spot VMs, MIGs provide efficient tracking and graceful shutdown of spot instances. When capacity is restored, MIGs automatically recreate the instances, which eliminates the need for manual tracking and management of these ephemeral resources. As such, these groups of instances are more consistently available and applications running through Symphony are going to be running smoother.

It's much easier to bulk create instances with MIGs, because in the end it boils down to making a single resize request. The MIG will then work towards attaining its target size by creating the required machines. When creating machines in bulk, you need to repeatedly run bulk instance create commands, verify the created instances, and manually create any missing instances. You'll have to manually tag the machines by instance template. Whereas MIGs give us a natural way of tracking and grouping said instances. For certain resources, like GPU VMs, MIGs offer the fastest way to bulk deploy them.

While we have a hard limit on how many instances we can have in a region, we can always use multiple MIGs for each instance template. Using regional MIGs allows instances to be spread across multiple zones within a region, increasing resilience against zonal failures. This setup ensures that your application remains available even if one zone encounters issues.

Things we couldn't make use of

Although MIGs support multiple versions for canary updates and for gradually rolled out updates, they're limited to two versions. This restriction makes it challenging to use MIGs for deploying multiple instance templates simultaneously, as required in cloud bursting. Moreover, when working with two versions, you have to specify a percentage for the machines to deploy for each version. Which isn't ideal for our purposes. An easy workaround is to use separate MIGs for different instance templates, which also provides finer control and allows more instances per template within MIG quotas.

Moreover, the autoscaling capabilities provided by MIGs aren't of any use to us. The default autoscaling metrics for MIGs focus on group-level resource usage, which does not align well with the task flow-based scaling requirements of IBM Spectrum Symphony. After all, we'd expect all of the instances in a MIG to be using most of, if not all of, their CPUs. Otherwise it'd be better to destroy them! Then there is the question of ownership. After all, the Host Factory wouldn't be aware of any newly created resources when scaling up for instance.

The load balancing features offered by MIGs are not necessary for our configuration, as the Service-Oriented Architecture Middleware (SOAM) in Symphony already manages task distribution and resource allocation. This redundancy can lead to inefficiencies if not properly managed, but fortunately, most of these features are opt-in, allowing us to bypass them.

How instance flexibility fits into our solution

A significant advancement in Google Cloud Platform's offering is the Instance Flexibility feature for Managed Instance Groups (MIGs). This capability allows a single MIG to provision and manage a mix of different machine types based on a user-defined template specification and a list of instance machine choices. This directly addresses a key limitation encountered with standard MIGs, where separate groups were needed for each instance template, thereby offering a way to consolidate and simplify the management infrastructure for cloud bursting scenarios involving diverse VM requirements.

Beyond simplified management, Instance Flexibility provides a crucial advantage for workloads commonly deployed in cloud bursting: optimizing the use of Spot VMs. While highly cost-effective, Spot VMs carry the inherent risk of preemption – abrupt termination by GCP when capacity is needed elsewhere. In complex, high-throughput computing environments like those managed by IBM Spectrum Symphony, such preemptions can lead to:

Task Failures: Work currently running on the preempted instance is lost.
Workflow Delays: Dependencies between tasks mean that the failure of one can cause cascading delays throughout the entire computational workflow.
Reduced Throughput: Frequent interruptions diminish the overall computational power delivered over time.

Instance Flexibility directly mitigates this challenge by incorporating preemption-aware provisioning. When configured to use Spot VMs, it doesn't just pick any available Spot instance from your list; it actively attempts to provision instances from the machine types and zones that, at that moment, have the lowest predicted preemption rates. This proactive selection aims to maximize the runtime of Spot VMs, reducing disruption to Symphony tasks. This is a massive boon to the observed efficiency of a compute cluster, because these disruptions are not so rare.

This built-in optimization has notable implications when considering integration with IBM Spectrum Symphony's Host Factory. The Host Factory is traditionally responsible for interpreting Symphony's resource demands and provisioning appropriate cloud instances. If you've read our article on cloud bursting, you'd understand why it further underlines our conclusions on bypassing the said component because by leveraging flexible MIGs, not only will we have a more natural mapping between cluster demand and resize events, but we'll also be able to delegate instance selection to GCP which would lead to a more efficient cluster on a larger scale. However, pursuing this path represents a significant architectural consideration. It remains exploratory, so thorough investigation, development effort for the custom requestor, and rigorous testing would be required.

It is also essential to acknowledge the current limitations of Instance Flexibility:

GPU Support: Based on current GCP documentation, Instance Flexibility does not support MIGs containing GPU instances. For workloads requiring GPUs, the strategy of using separate, standard MIGs for each specific GPU instance template remains necessary.
Autoscaling: Instance Flexibility MIGs do not support the standard MIG autoscaling features. As previously discussed, however, these autoscaling mechanisms were not suitable for us, considering that it's the Host Factory that drives scaling events, so this limitation does not impact our specific use case.

Operational Considerations

MIGs are a very light abstraction provided by Google Cloud Platform, they come at no additional cost. You only pay for the resources you use in or with the MIG. We created our instance templates, associated them to MIGs, and by setting the target size to 0 in our terraform configuration, we can efficiently manage scaling without any unnecessary expenses.

MIGs integrate seamlessly with Google Cloud Monitoring, providing a single pane of glass for observing instance groups. Alongside MIG specific insights for CPU and memory usage trends and predictions. Monitoring creation and failure events is simplified by tracking a single resize operation and corresponding logs, offering a clear and consolidated view of resource status and operational health.

The ability to bulk create, resize, and destroy instances through simple, centralized resize commands that are easy to track in logs makes MIGs an efficient choice for managing large-scale, dynamic environments. This streamlined approach reduces the operational overhead typically associated with managing individual VMs or bulk operations

Choosing Google Cloud MIGs for Dynamic Scaling: A Technical Deep Dive

Share this article

Why we went with MIGs

Things we couldn't make use of

How instance flexibility fits into our solution

Operational Considerations

Ces articles peuvent vous intéresser

Newsletter