Optimizations of polyhedral reductions and their use in algorithm-based fault tolerance
Date
2025
Journal Title
Journal ISSN
Volume Title
Abstract
In this dissertation, we study the optimization of programs containing reductions and motivate a deeper connection between two ostensibly unrelated problems, one involving techniques for algorithmic improvement and another in the domain of Algorithm-Based Fault Tolerance. Reductions combine collections of inputs with an associative and often commutative operator to produce collections of outputs. Such operations are interesting because they often require special handling to obtain good performance. When the same value contributes to multiple outputs, there is an opportunity to reuse partial results, enabling reduction simplification. Prior work showed how to exploit this and obtain a reduction (pun intended) in the program's asymptotic complexity through a program transformation called simplification. We propose extensions to prior work on simplification and provide the first complete push-button implementation of reduction simplification in a compiler and show how to handle a strictly more general class of programs than previously supported. We evaluate its effectiveness and show that simplification rediscovers several key results in algorithmic improvement across multiple domains, previously only obtained through clever manual human analysis and effort. Additionally, we complement this and study the automation of generalized and automated fault tolerance against transient errors, such as those occurring due to cosmic radiation or hardware component aging and degradation, using Algorithm-Based Fault Tolerance (ABFT). ABFT methods typically work by adding some redundant computation in the form of invariant checksums (i.e., reductions), which, by definition, should not change as the program executes. By computing and monitoring checksums, it is possible to detect errors by observing differences in the checksum values. However, this is challenging for two key reasons: (1) it requires careful manual analysis of the input program, and (2) care must be taken to subsequently carry out the checksum computations efficiently enough for it to be worth it. We propose automation techniques for a class of scientific codes called stencil computations and give methods to carry out this analysis at compile time. This is the first work to propose such an analysis in a compiler.
Description
Includes summary in French; Chapter 1. Résumé en français.
Rights Access
Subject
fault tolerance
algorithmic improvement
polyhedral compilation