EDF R&D Next-gen elastic HPC Clusters on AWS
The context
EDF R&D runs a substantial number of numerical simulations and maintains a variety of on-premises High-Performance Computing (HPC) infrastructures, including Slurm clusters such as CRONOS and GAIA. GAIA was scheduled to be decommissioned and replaced by a next generation HPC cluster, SELENA, by the end of 2025. In the meantime, EDF R&D had an estimated gap of 100 million core-hours in computing capacity for the next year.
Aneo built for EDF R&D an elastic HPC computing solution on AWS to address, on a project-by-project basis, manual overflow needs for providing additional computing resources and meet resilience challenges in case of internal outages. In addition, EDF R&D also aimed to equip its researchers with advanced, on-demand state-of-the-art computing resources like the latest GPUs, and port EDF’s flagship CFD to be able to exploit them.
Challenge
- Maintain a consistent user experience while allowing EDF to go beyond the traditional static HPC cluster setups.
- Provide flexible HPC deployments while keeping with EDF’s strict security requirements in terms of user access and data management.
- Accompany researchers in their discovery of what the new platform could do for them.
- Technical complexity of adding to EDF’s flagship CFD code Nvidia GPU support.
Impact
- Possibility to deploy specialized, targeted and possibly ephemeral HPC Clusters on a project-by-project basis or other relevant criteria.
- Ability to expand EDF R&D HPC capacity to absorb the charge of traditional busy periods at the middle and end of the year.
- A platform that can be RUN by less than one FTE year-round without any impact in up-time or support quality.
- The ability to take advantage of on-demand access to GPUs and other state-of-the-art computing and run their most used CFD code on them with significant performance gains.