Electronic International Standard Serial Number (EISSN)
1741-2846
abstract
The current static usage model of HPC systems is becoming increasingly inefficient. This is driven by the continuously growing complexity and heterogeneity of system architectures, in combination with the increased usage of coupled applications, the need for strong scaling with extreme scale parallelism, and the increasing reliance on complex and dynamic workflows. Therefore, we see a rise in research on malleable systems, middleware software and applications, which can adjust resources usage dynamically in order to extract a maximum of efficiency. By providing an intelligent global coordination of resources usage, through runtime scheduling of computation, network usage, and I/O across all components of the system architecture, malleable HPC systems can maximize the exploitation of their resources, while at the same time minimizing the makespan of applications in many, if not most, cases. Of particular concern is the emerging class of data-intensive applications and their interaction with classic simulation workloads, driven by the growing need to process extremely large datasets. However, uncoordinated file access in combination with limited bandwidth make the I/O system a serious bottleneck. Emerging multi-tier storage hierarchies come with the potential to remove this barrier, but maximizing performance still requires careful control to avoid congestion. Malleability allows systems to dynamically adjust the computation and storage needs of applications on the one side and the global system on the other. Such malleable systems, however, face a series of fundamental research challenges, including who initiates changes in resource availability or usage? How is it communicated? How to compute the optimal usage? How can applications cope with dynamically changing resources? What should malleable programming models and abstractions look like? How to design resource management frameworks for malleable systems? Which resources benefit from malleability, and which (if any) should still be managed statically?