Optimizations of polyhedral reductions and their use in algorithm-based fault tolerance

Narmour, Louis, author; Rajopadhye, Sanjay, advisor; Pouchet, Louis-Noël, committee member; Prabhu, Vinayak, committee member; Pezeshki, Ali, committee member

Optimizations of polyhedral reductions and their use in algorithm-based fault tolerance

Files

Narmour_colostate_0053A_18789.pdf (4.9 MB)

Date

2025

Authors

Narmour, Louis, author

Rajopadhye, Sanjay, advisor

Pouchet, Louis-Noël, committee member

Prabhu, Vinayak, committee member

Pezeshki, Ali, committee member

Abstract

In this dissertation, we study the optimization of programs containing reductions and motivate a deeper connection between two ostensibly unrelated problems, one involving techniques for algorithmic improvement and another in the domain of Algorithm-Based Fault Tolerance. Reductions combine collections of inputs with an associative and often commutative operator to produce collections of outputs. Such operations are interesting because they often require special handling to obtain good performance. When the same value contributes to multiple outputs, there is an opportunity to reuse partial results, enabling reduction simplification. Prior work showed how to exploit this and obtain a reduction (pun intended) in the program's asymptotic complexity through a program transformation called simplification. We propose extensions to prior work on simplification and provide the first complete push-button implementation of reduction simplification in a compiler and show how to handle a strictly more general class of programs than previously supported. We evaluate its effectiveness and show that simplification rediscovers several key results in algorithmic improvement across multiple domains, previously only obtained through clever manual human analysis and effort. Additionally, we complement this and study the automation of generalized and automated fault tolerance against transient errors, such as those occurring due to cosmic radiation or hardware component aging and degradation, using Algorithm-Based Fault Tolerance (ABFT). ABFT methods typically work by adding some redundant computation in the form of invariant checksums (i.e., reductions), which, by definition, should not change as the program executes. By computing and monitoring checksums, it is possible to detect errors by observing differences in the checksum values. However, this is challenging for two key reasons: (1) it requires careful manual analysis of the input program, and (2) care must be taken to subsequently carry out the checksum computations efficiently enough for it to be worth it. We propose automation techniques for a class of scientific codes called stencil computations and give methods to carry out this analysis at compile time. This is the first work to propose such an analysis in a compiler.

Description

Includes summary in French; Chapter 1. Résumé en français.

Subject

fault tolerance

algorithmic improvement

polyhedral compilation

URI

https://hdl.handle.net/10217/241013
https://doi.org/10.25675/3.05100

Collections

2020-
Theses and Dissertations

Full item page

Optimizations of polyhedral reductions and their use in algorithm-based fault tolerance

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Abstract

Description

Rights Access

Subject

Citation

URI

Associated Publications

Collections