Reliability Evaluation of LU Decomposition on GPU-Accelerated System-on-Chip Under Proton Irradiation

Graphic processing units (GPUs) have become a basic accelerator both in high-performance nodes and low-power system-on-chip (SoC). They provide massive data parallelism and very high performance per watt. However, their reliability in harsh environments is an important issue to take into account, especially for safety-critical applications. In this article, we evaluate the influence of the parallelization strategy on the reliability of lower-upper (LU) decomposition on a GPU-accelerated SoC under proton irradiation. Specifically, we compare a memory bound and a compute bound implementation of the decomposition on a K20A GPU embedded on a Tegra K1 (TK1) SoC. We leverage the GPU and CPU clock frequencies both to highlight the radiation sensitivity of the GPU where we are running the benchmark and also to apply both algorithms to solve problems with the same size when exposed to the same radiation dose. Results show that more intensive use of the resources of the GPU increases the cross section. We also observed that most of the radiation-induced errors hang the operating system and even the rebooting process. Finally, we present a preliminary study of the error propagation of the LU decomposition algorithms.

fault tolerance; graphic processing unit (gpu); lower-upper (lu) decomposition; proton irradiation; system-on-chip (soc)

Reliability Evaluation of LU Decomposition on GPU-Accelerated System-on-Chip Under Proton Irradiation Articles