Malleability and fault tolerance in ad-hoc parallel file systems
Articles
Overview
published in
publication date
- September 2025
start page
- 1
end page
- 19
volume
- 28
Digital Object Identifier (DOI)
full text
International Standard Serial Number (ISSN)
- 1386-7857
Electronic International Standard Serial Number (EISSN)
- 1573-7543
abstract
- In the last few years, there has been a significant rise in I/O demands across different applications, especially those related to Big Data and Artificial Intelligence. This has resulted in the requirement for improvements to I/O capabilities in order to prevent potential data access bottlenecks. The Expand Ad-Hoc parallel file system is being designed and developed to address this. Given the possibility of variations in I/O workloads throughout an application¿s execution, it is essential that the file system employed possesses the requisite malleability to adapt accordingly to avoid wasting resources. Additionally, as these applications have long execution times, fault tolerance mechanisms within the file system are essential to ensure their continued operation in the presence of failures. The primary innovation of this work lies in the design of the support for malleability, taking into account the data replication of the fault tolerance. This approach results in a system that exhibits both expansion and reduction capability, made possible by the integration of malleability and fault tolerance. Therefore, this work presents a design for the Expand Ad-Hoc parallel file system and evaluates it on the HPC4AI Laboratory supercomputer in Turin. We have evaluated this design using benchmarks and real applications by comparing the result with the native file system used on the HPC4AI supercomputer. We have evaluated the fault tolerance support using benchmarks and real applications by comparing the result with the native file system used on the HPC4AI supercomputer. For the real application used, improvements of up to 100% have been obtained regarding the backend file system. For malleability, the time required to execute a malleability operation has been analyzed. In this case, the results show that it is possible to perform this malleability operation with very small overhead times. The results show that the malleability scales effectively with the resources, and despite the data replication in the fault tolerance, we outperform other parallel file systems without fault tolerance.
Classification
subjects
- Computer Science
keywords
- malleability; fault tolerance; expand ad-hoc; parallel file system; ad-hoc file system