LIMITLESS - LIght-weight MonItoring Tool for LargE Scale Systems

This work presents LIMITLESS, a HPC framework that provides new strategies for monitoring clusters. LIMITLESS is a scalable light-weight monitor that is integrated with other HPC runtimes in order to obtain a holistic view of the system that combines both platform and application monitoring. This paper presents a description of the novel components of the architecture, including new approaches for reaching a higher scalability based on a combination of in-transit processing and performance prediction. We also include a methodology for improving application scheduling by means of machine learning classifiers and application profiling. This work also includes a practical evaluation on simulated and real platforms, that shows significant monitoring scalability, retrieving data capacity and reduced overheads. Results show that the performance prediction techniques reduce communications and the number of monitoring packets by more than 90% on average, and the fine-grain scheduling allows LIMITLESS to run applications in shared nodes reducing the makespan by 25% and saving resources.

LIMITLESS - LIght-weight MonItoring Tool for LargE Scale Systems Articles