diff --git a/CMakeLists.txt b/CMakeLists.txt index 5aa6ec3..ecaac8b 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -16,9 +16,9 @@ set(FENIX_VERSION_MAJOR 1) set(FENIX_VERSION_MINOR 0) option(BUILD_EXAMPLES "Builds example programs from the examples directory" OFF) -option(BUILD_TESTING "Builds tests and test modes of files" ON) -option(BUILD_DOCS "Builds documentation if is doxygen found" ON) -option(DOCS_ONLY "Only build documentation" OFF) +option(BUILD_TESTING "Builds tests and test modes of files" ON) +option(BUILD_DOCS "Builds documentation if is doxygen found" ON) +option(DOCS_ONLY "Only build documentation" OFF) #Solves an issue with some system environments putting their MPI headers before #the headers CMake includes. Forces non-system MPI headers when incorrect headers diff --git a/doc/CMakeLists.txt b/doc/CMakeLists.txt index f5c0350..90ba9cf 100644 --- a/doc/CMakeLists.txt +++ b/doc/CMakeLists.txt @@ -1,8 +1,8 @@ find_package(Doxygen) -set(FENIX_DOCS_OUTPUT ${CMAKE_CURRENT_BINARY_DIR} CACHE INTERNAL "Documentation output directory") -set(FENIX_DOCS_MAN "YES" CACHE INTERNAL "Option to disable man page generation for CI builds") -set(FENIX_BRANCH "local" CACHE INTERNAL "Git branch being documented, or local if not building for Github Pages") +set(FENIX_DOCS_OUTPUT ${CMAKE_CURRENT_BINARY_DIR} CACHE PATH "Documentation output directory") +set(FENIX_DOCS_MAN "YES" CACHE BOOL "Option to disable man page generation for CI builds") +set(FENIX_BRANCH "local" CACHE BOOL "Git branch being documented, or local if not building for Github Pages") if(NOT DOXYGEN_FOUND) message(STATUS "Doxygen not found, `make docs` disabled") diff --git a/doc/markdown/DataRecovery.md b/doc/markdown/DataRecovery.md index e2f8ddf..f5e8d3b 100644 --- a/doc/markdown/DataRecovery.md +++ b/doc/markdown/DataRecovery.md @@ -35,9 +35,10 @@ redundant storage, can be associated with a specific instance of a Fenix *data group* to form a semantic unit. Committing such a group ensures that the data involved is available for recovery. +----- + ## Data Groups ------ A Fenix *data group* provides dual functionality. First, it serves as a container for a set of data objects (*members*) that are committed together, and hence provides transaction semantics. @@ -55,9 +56,10 @@ conditionally skip the creation after initialization). See #Fenix_Data_group_create for creating a data group. +----- + ## Data Redundancy Policies ------ Fenix internally uses an extensible system for defining data policies to keep the door open to easily adding new data policies and configuring them on a per-data-group basis. We currently diff --git a/doc/markdown/ProcessRecovery.md b/doc/markdown/ProcessRecovery.md index e2d8e1f..56f338a 100644 --- a/doc/markdown/ProcessRecovery.md +++ b/doc/markdown/ProcessRecovery.md @@ -1,19 +1,116 @@ -Functions and types for process recovery. - -* Only communicators derived from the communicator returned by -Fenix_Init are eligible for reconstruction. -After communicators have been repaired, they contain the same -number of ranks as before the failure occurred, unless the user -did not allocate sufficient redundant resources (*spare ranks*) -and instructed Fenix not to create new ranks. In this case -communicators will still be repaired, but will contain fewer -ranks than before the failure occurred. - -* To ease adoption of MPI fault tolerance, Fenix automatically -captures any errors resulting from MPI library calls that are a -result of a damaged communicator (other errors reported by the -MPI runtime are ignored by Fenix and are returned to the -application, for handling by the application writer). In other -words, programmers do not need to replace calls to the MPI library -with calls to Fenix (for example, *Fenix_Send* instead of -*MPI_Send*). +Process recovery within Fenix can be broken down into three steps: detection, +communicator recovery, and application recovery. + +--- + +## Detecting Failures + +Fenix is built on top of ULFM MPI, so specific fault detection mechanisms and +options can be found in the [ULFM +documentation](https://docs.open-mpi.org/en/v5.0.x/features/ulfm.html#). At a +high level, this means that Fenix will detect failures when an MPI function +call is made which involves a failed rank. Detection is not collectively +consistent, meaning some ranks may fail to complete a collective while other +ranks finish successfully. Once a failure is detected, Fenix will 'revoke' the +communicator that the failed operation was using and the top-level communicator +output by #Fenix_Init (these communicators are usually the same). The +revocation is permanent, and means that all future operations on the +communicator by any rank will fail. This allows knowledge of the failed rank to +be propagated to all ranks in the communicator, even if some ranks would never +have directly communicated with the failed rank. + +Since failures can only be detected during MPI function calls, applications with +long periods of communication-free computation will experience delays in beginning +recovery. Such applications may benefit from inserting periodic calls to +#Fenix_Process_detect_failures to allow ranks to participate in global recovery +operations with less delay. + +Fenix will only detect and respond to failures that occur on the communicator +provided by #Fenix_Init or any communicators derived from it. Faults on other +communicators will, by default, abort the application. Note that having +multiple derived communicators is not currently recommended, and may lead to +deadlock. In fact, even one derived communicator may lead to deadlock if not +used carefully. If you have a use case that requires multiple communicators, +please contact us about your use case -- we can provide guidance and may be +able to update Fenix to support it. + +**Advanced:** Applications may wish to handle some failures themselves - either +ignoring them or implementing custom recovery logic in certain code regions. +This is not generally recommended. Significant care must be taken to ensure +that the application does not attempt to enter two incompatible recovery steps. +However, if you wish to do this, you can include "fenix_ext.h" and manually set +`fenix.ignore_errs` to a non-zero value. This will cause Fenix's error handler +to simply return any errors it encounters as the exit code of the application's +MPI function call. Alternatively, applications may temporarily replace the +communicator's error handler to avoid Fenix recovery. If you have a use case +that would benefit from this, you can contact us for guidance and/or to request +some specific error handling features. + +--- + +## Communicator Recovery + +Once a failure has been detected, Fenix will begin the collective process of +rebuilding the resilient communicator provided by #Fenix_Init. There are two +ways to rebuild: replacing failed ranks with spares, or shrinking the +communicator to exclude the failed ranks. If there are any spares available, +Fenix will use those to replace the failed ranks and maintain the original +communicator size and guarantee that surviving processes keep the same rank ID. +If there are not enough spares, some processes may have a different rank ID on +the new communicator, and Fenix will warn the user about this by setting the +error code for #Fenix_Init to #FENIX_WARNING_SPARE_RANKS_DEPLETED. + +**Advanced:** Communicator recovery is collective, blocking, and not +interruptable. ULFM exposes some functions (e.g. MPIX_Comm_agree, +MPIX_Comm_shrink) that are also not interrupable -- meaning they will continue +despite any failures or revocations. If multiple collective, non-interruptable +operations are started by different ranks in different orders, the application +will deadlock. This is similar to what would happen if a non-resilient +application called multiple collectives (e.g. `MPI_Allreduce`) in different +orders. However, the preemptive and inconsistent nature of failure recovery +makes it more complex to reason about ordering between ranks. Fenix uses these +ULFM functions internally, so care is taken to ensure that the order of +operations is consistent across ranks. Before any such operation begins, Fenix +first uses MPIX_Comm_agree on the resilient communicator provided by +#Fenix_Init to agree on which 'location' will proceed - if there is any +disagreement, all ranks will enter recovery as if they had detected a failure. +Applications which wish to use these functions themselves should follow this +pattern, providing a unique 'location' value for any operations that may be +interrupted. + +--- + +## Application Recovery + +Once a new communicator has been constructed, application recovery begins. +There are two recovery modes: jumping (default) and non-jumping. With jumping +recovery, Fenix will automatically `longjmp` to the #Fenix_Init call site once +communicator recovery is complete. This allows for very simple recovery logic, +since it mimics the traditional teardown-restart pattern. However, `longjmp` +has many undefined semantics according to the C and C++ specifications and may +result in unexpected behavior due to compiler assumptions and optimizations. +Additionally, some applications may be able to more efficiently recover by +continuing inline. Users can initialize Fenix as non-jumping (see test/no_jump) +to instead return an error code from the triggering MPI function call after +communicator recovery. This may require more intrusive code changes (checking +return statuses of each MPI call). + +Fenix also allows applications to register one or more callback functions with +#Fenix_Callback_register and #Fenix_Callback_pop, which removes the most +recently registered callback. These callbacks are invoked after communicator +recovery, just before control returns to the application. Callbacks are +executed in the reverse order they were registered. + +For C++ applications, it is recommended to use Fenix in non-jumping mode and to +register a callback that throws an exception. At it's simplest, wrapping +everything between #Fenix_Init and #Fenix_Finalize in a single try-catch can +give the same simple recovery logic as jumping mode, but without the undefined +behavior of `longjmp`. + +#Fenix_Init outputs a role, from #Fenix_Rank_role, which helps inform the +application about the recovery state of the rank. It is important to note that +all spare ranks are captured inside #Fenix_Init until they are used for +recovery. Therefore, after recovery, recovered ranks will not have the same +callbacks registered -- recovered ranks will need to manually invoke any +callbacks that use MPI functions. These roles also help the application more +generally modify it's behavior based on each rank's recovery state. diff --git a/include/fenix.h b/include/fenix.h index ca310ea..0a1d783 100644 --- a/include/fenix.h +++ b/include/fenix.h @@ -75,6 +75,7 @@ extern "C" { /** * @defgroup ReturnCodes Return Codes * @brief All possible return codes from Fenix functions. + * Errors are negative, warnings are positive. * @{ */ #define FENIX_SUCCESS 0 @@ -115,6 +116,7 @@ extern "C" { /** * @defgroup ProcessRecovery Process Recovery + * @brief Functions for managing process recovery in Fenix. * @details @include{doc} ProcessRecovery.md * @{ */