This article summarizes the performance evaluation of Simcenter StarCCM+, SeWas and Chameleon on Google Cloud Platform (GCP), specifically focusing on the C2D and H4D machine types. Our aim is to provide insights for optimal infrastructure selection when running high-performance computing tasks like computational fluid dynamics simulations and seismic wave propagation.
My tests were conducted in the europe-west4 region, specifically in zone europe-west4-b.
h4d-highmem-192-lssd configuration, featuring 192 vCPUs and 1488 GB of memory (2 sockets x 96 cores per socket and 1 thread per core, with SMT disabled).The operating system was Rocky Linux 8. For the H4D machines I configured the network to leverage the Cloud RDMA driver for optimal performance.
The rest of the article is divided into three sections summarizing our findings for each of the applications.
Simcenter StarCCM+ is a comprehensive multiphysics CFD software suite. For these benchmarks, I used Simcenter STAR-CCM+ 2310 Build 18.06.006 (linux-x86_64-2.17/gnu11.2). For this application, I employed Ramble, a GCP utility designed for reproducible benchmarks. Ramble automates software installation, input file acquisition, experiment configuration, and result extraction within a SLURM cluster environment.
For StarCCM+, we tested:
Our test case was the public dataset LeMans-104m, part of the library tests bundle. I focused on the following performance metrics:
The following table summarizes the performance indicators for the LeMans-104m test case:
| VM_Type | VMs | Workers | Avg Elapsed Time (s) | Cell-Iterations per Worker-Second | Resident HWM Mem (GB) |
|---|---|---|---|---|---|
| H4D | 1 | 92 | 4.452 | 122082 | 365.036 |
| H4D | 3 | 84 | 2.178 | 124113 | 835.341 |
| H4D | 4 | 76 | 1.082 | 122974 | 1329.433 |
| H4D | 7 | 134 | 0.564 | 132622 | 2238.031 |
| C2D | 4 | 224 | 3.766 | 119136 | 283.470 |
| C2D | 8 | 448 | 1.829 | 117148 | 419.152 |
| C2D | 12 | 672 | 1.197 | 114705 | 552.646 |
Note: For these runs, 7 H4D VMs was the maximum VM count we could allocate during the early technical preview.
From the start, I decided the training should be hands-on and experience-driven, not a theoretical walkthrough. Each session was built around realistic operational scenarios, from basic deployment to incident response. Rather than listing every feature, the focus was on how to think when running and maintaining the system.
To keep it adaptable, I organized the program in progressive stages, covering about half-day workshops that could be sequenced across a single week or spread over multiple sprints:
This progression allowed participants to apply new knowledge directly, reinforcing understanding through practice.
The H4D VM type uses significantly more memory, peaking at 2238.031 GB (H4D 7 VMs). However, it is important to note that it is not merely the VM type that is responsible for this higher memory usage; StarCCM+ is the application consuming more memory on this machine. The highest memory usage for H4D is approximately 304.8% higher (or about 4.05x greater) than the peak memory usage for C2D (552.646 GB for C2D 12 VMs), which has much lower memory usage, peaking at only 552.646 GB (C2D 12 VMs).
Additionally, this high water mark (HWM) remains very low compared to the total amount of memory available, which might even explain why StarCCM+ is utilizing more memory during its operations.
The H4D VM type demonstrates superior performance (higher throughput and lower average elapsed time) compared to the C2D VM type, especially when scaled up.
However, the C2D VM type is far more memory-efficient for this test, using approximately 4× less memory than the H4D VM type at its respective largest configurations (552.646 GB for C2D vs 2238.031 GB for H4D).
Hence, based on this benchmark, our conclusions are that the choice between H4D and C2D depends on the priority:
SeWaS is a seismic wave propagation simulator tailored for massively parallel hardware infrastructures. The application dataflow is built on top of PaRSEC, a generic task-based runtime system. In this section, we compare the performance of H4D and C2D VM types running SeWaS on a single node, focusing on execution time and parallelism. The application was built using the CMake installation procedure described in the official documentation, with Intel MPI version 2021.13.1 instead of OpenMPI for both H4D and C2D runs.
The following table summarizes the performance indicators for this test:
| Metric | H4D Value (s) | C2D Value (s) |
|---|---|---|
| Global Elapsed Time | 226.30 | 615.72 |
| Global CPU Time | 38265.44 | 33131.05 |
| Core Simulation Elapsed Time | 225.56 | 609.72 |
| ComputeVelocity Elapsed Time | 11724.35 | 10433.13 |
| ComputeStress Elapsed Time | 20964.25 | 17939.95 |
Our interpretation is that individual core efficiency of C2D may be slightly superior for these specific functions, requiring less total CPU time. However, H4D’s significant advantage in parallel execution capabilities far outweighs this minor advantage, resulting in a substantial reduction in wall-clock time.
The H4D VM configuration provides a significant advantage in overall wall-clock time due to its high level of parallelism. This performance difference is maintained despite C2D showing lower total accumulated CPU time for the individual core functions, confirming that H4D's capacity for high concurrency is the primary factor affecting the final elapsed time.
CHAMELEON is a C-based framework for solving dense linear algebra problems, including systems of linear equations and linear least squares, using factorizations like LU, Cholesky, QR, and LQ. It supports both real and complex arithmetic in single and double precision. The application was set up by following the Spack installation instructions, with the same version of Intel MPI as the previous two benchmarks. I tested the CHAMELEON DGEMM (Double-precision General Matrix Matrix Multiply) function performance on H4D VMs across varying workload sizes and node counts.
Comparing the 1000³ matrix run to the 10000³ matrix run on a single H4D node shows a significant efficiency gain as the workload size increases.
| Configuration | Matrix Size (m, n, k) | Time (s) | GFLOPS (Performance) |
|---|---|---|---|
| 1 Node H4D | 1000³ | 0.00579 | 345.24 |
| 1 Node H4D | 10000³ | 0.4397 | 4548.37 |
Performance Increase: By increasing the matrix size from 1000³ to 10000³ (a 1000x increase in total operations), the GFLOPS figure increases from 345.24 to 4548.37. This represents a performance gain of approximately 1217% (or 13.17x), demonstrating that the H4D node is utilized far more efficiently by larger, data-intensive tasks.
This comparison examines the effect of moving from one node to two nodes while keeping the total workload (10000³ matrix) constant.
| Configuration | Matrix Size (m, n, k) | Time (s) | GFLOPS (Performance) |
|---|---|---|---|
| 1 Node H4D | 10000³ | 0.4397 | 4548.37 |
| 2 Nodes H4D | 10000³ | 0.3103 | 6444.60 |
Time Reduction (Speedup): Increasing the node count from 1 to 2 reduces the execution time from 0.4397 s to 0.3103 s. This constitutes a time reduction of approximately 29.4%, corresponding to a speedup factor of approximately 1.42x.
Performance Scaling: The GFLOPS figure increases from 4548.37 to 6444.60, an increase of approximately 41.7%. This shows that adding a second node provides significant, though sub-linear, scaling in performance.
The final test examines a massive workload across 1 node versus 2 nodes.
| Configuration | Matrix Size (m, n, k) | Time (s) | GFLOPS (Performance) |
|---|---|---|---|
| 1 Node H4D | 100000³ | 278.46 | 7182.44 |
| 2 Nodes H4D | 100000³ | 131.71 | 15184.33 |
Time Reduction (Speedup): Distributing this extreme workload across 2 nodes reduces the time from 278.46 s to 131.71 s, achieving a speedup factor of 2.11x.
Efficiency: The observed speedup (2.11x) is greater than the increase in resources (2x nodes), which is typically categorized as super-linear scaling, indicating high efficiency and utilization of a multi-node H4D cluster for the most intense computational tasks.
Maximum Performance: The 2-node system achieves its highest recorded performance, 15184.33 GFLOPS (or 15.18 TFLOPS), confirming that distributing the largest, most computationally intense tasks across multiple H4D nodes maximizes the cluster's utilization and raw performance capability.
In summary, the data confirms that Chameleon's DGEMM routine benefits significantly from both increased workload size on a single node and effective parallel distribution across multiple H4D nodes for the most demanding problems.
The performance evaluation of Simcenter StarCCM+, SeWas, and Chameleon on Google Cloud Platform highlights significant advantages of the H4D VM types over C2D VMs, particularly for high-performance computing tasks like computational fluid dynamics simulations and seismic wave propagation:
Choosing between H4D and C2D VMs ultimately depends on specific use cases and requirements. For applications demanding maximum computational performance and speed, H4D is the clear winner, making it ideal for high-end simulations and analyses. Conversely, for scenarios where memory efficiency is paramount, C2D VMs remain a competent choice, albeit at the expense of lower throughput and longer execution times.
Overall, these benchmarks provide critical insights for organizations looking to leverage cloud-based HPC solutions, enabling informed decisions tailored to their computational needs. Considering the performance improvements observed, I encourage you to benchmark your workloads and see these improvements by yourself.