Malleability and fault tolerance in ad-hoc parallel file systems Articles uri icon

publication date

  • September 2025

start page

  • 1

end page

  • 19

volume

  • 28

International Standard Serial Number (ISSN)

  • 1386-7857

Electronic International Standard Serial Number (EISSN)

  • 1573-7543

abstract

  • In the last few years, there has been a significant rise in I/O demands across different applications, especially those related to Big Data and Artificial Intelligence. This has resulted in the requirement for improvements to I/O capabilities in order to prevent potential data access bottlenecks. The Expand Ad-Hoc parallel file system is being designed and developed to address this. Given the possibility of variations in I/O workloads throughout an application¿s execution, it is essential that the file system employed possesses the requisite malleability to adapt accordingly to avoid wasting resources. Additionally, as these applications have long execution times, fault tolerance mechanisms within the file system are essential to ensure their continued operation in the presence of failures. The primary innovation of this work lies in the design of the support for malleability, taking into account the data replication of the fault tolerance. This approach results in a system that exhibits both expansion and reduction capability, made possible by the integration of malleability and fault tolerance. Therefore, this work presents a design for the Expand Ad-Hoc parallel file system and evaluates it on the HPC4AI Laboratory supercomputer in Turin. We have evaluated this design using benchmarks and real applications by comparing the result with the native file system used on the HPC4AI supercomputer. We have evaluated the fault tolerance support using benchmarks and real applications by comparing the result with the native file system used on the HPC4AI supercomputer. For the real application used, improvements of up to 100% have been obtained regarding the backend file system. For malleability, the time required to execute a malleability operation has been analyzed. In this case, the results show that it is possible to perform this malleability operation with very small overhead times. The results show that the malleability scales effectively with the resources, and despite the data replication in the fault tolerance, we outperform other parallel file systems without fault tolerance.

subjects

  • Computer Science

keywords

  • malleability; fault tolerance; expand ad-hoc; parallel file system; ad-hoc file system