Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities Articles uri icon

authors

  • Tarraf, Ahmad
  • Schreiber, Martin
  • CASCAJO GARCIA, ALBERTO
  • Besnard, Jean-Baptiste
  • Vef, Marc-André
  • Huber, Dominik
  • Happ, Sonja
  • Brinkmann, Andre
  • EXPOSITO SINGH, DAVID
  • Hoppe, Hans-Christian
  • Miranda, Alberto
  • Peña, Antonio
  • Machado, Rui
  • Garcia-Gasulla, Marta
  • Schultz, Martin
  • Carpenter, Paul M.
  • Pickartz, Simon
  • Rotaru, Tiberiu
  • Iserte, Sergio
  • Lopez, Victor
  • Ejarque, Jorge
  • Sirwani, Heena
  • CARRETERO PEREZ, JESUS
  • Wolf, Felix

publication date

  • September 2024

start page

  • 1551

end page

  • 1564

volume

  • 35

International Standard Serial Number (ISSN)

  • 1045-9219

Electronic International Standard Serial Number (EISSN)

  • 1558-2183

abstract

  • With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications" configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for more than two decades. This article presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.

subjects

  • Computer Science

keywords

  • resource management; runtime; monitoring; dynamic scheduling; throughput; terminology; systems support; malleability; state of the art; survey; hpc; malleable; hpc systems; resource allocation; system resources; system throughput; dynamic resource; dynamic allocation; dynamic resource allocation; service quality; resource management; parallelization; management process; development of applications; data transfer; programming model; fault tolerant; dynamic loading; resource usage; file system; job scheduling; resource assignment; load balancing; runtime environment; checkpointing; infrastructure monitoring; changes in workload; caching; storage resources; resource management system