Skip to content

Commit

Permalink
Update to remove FIXMEs
Browse files Browse the repository at this point in the history
  • Loading branch information
edwardalee committed Jan 30, 2024
1 parent 6f86704 commit 39b70cb
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 21 deletions.
18 changes: 1 addition & 17 deletions examples/C/src/leader-election/HeartbeatBully.lf
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,7 @@
* is set so that each primary fails after sending three heartbeat messages. When all nodes have
* failed, then the program exits.
*
* This example is designed to be run as a federated program with decentralized coordination.
* However, as of this writing, bugs in the federated code generator cause the program to fail
* because all federates get the same bank_index == 0. This may be related to these bugs:
*
* - https://github.com/lf-lang/lingua-franca/issues/1961
* - https://github.com/lf-lang/lingua-franca/issues/1962
*
* When these bugs are fixed, then the federated version should operate exactly the same as the
* unfederated version except that it will become possible to kill the federates instead of having
* them fail on their own. The program should also be extended to include STP violation handlers to
* deal with the fundamental CAL theorem limitations, where unexpected network delays make it
* impossible to execute the program as designed. For example, if the network becomes partitioned,
* then it becomes possible to have two primary nodes simultaneously active.
* This example is designed to be run as a federated program.
*
* @author Edward A. Lee
* @author Marjan Sirjani
Expand Down Expand Up @@ -101,10 +89,6 @@ reactor Node(
}
}
}
// FIXME
// =} STP (0) {=
// FIXME: What should we do here.
// lf_print_error("Node %d had an STP violation. Ignoring heartbeat as if it didn't arrive at all.", self->bank_index);
=}

reaction(t) -> reset(Prospect) {=
Expand Down
15 changes: 11 additions & 4 deletions examples/C/src/leader-election/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
# Leader Election

These federated programs implements a redundant fault-tolerant system where a primary node, if and when it fails, is replaced by a backup node. The protocol is described in this paper:
These federated programs implement redundant fault-tolerant systems where a primary node, if and when it fails, is replaced by a backup node. The HeartbeatBully example is described in this paper:

> Bjarne Johansson; Mats Rågberger; Alessandro V. Papadopoulos; Thomas Nolte, "Consistency Before Availability: Network Reference Point based Failure Detection for Controller Redundancy," Emerging Technologies and Factory Automation (ETFA), 12-15 September 2023, [DOI:10.1109/ETFA54631.2023.10275664](https://doi.org/10.1109/ETFA54631.2023.10275664)
> B. Johansson, M. Rågberger, A. V. Papadopoulos and T. Nolte, "Heartbeat Bully: Failure Detection and Redundancy Role Selection for Network-Centric Controller," IECON 2020 The 46th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 2020, pp. 2126-2133, [DOI: 10.1109/IECON43393.2020.9254494](https://doi.org/10.1109/IECON43393.2020.9254494).
The NRP examples extend the algorithm to reduce the likelihood of getting multiple primaries when the network becomes partitioned. The NRP protocol is described in this paper:

The key idea in this protocol is that when a backup fails to detect the heartbeats of a primary node, it becomes primary only if it has access to Network Reference Point (NRP), which is a point in the network. This way, if the network becomes partitioned, only a backup that is on the side of the partition that still has access to the NRP can become a primary. If a primary loses access to the NRP, then it relinquishes its primary role because it is now on the wrong side of a network partition. A backup on the right side of the partition will take over. The "FD" in the names of the programs stands for "fault detection."
> B. Johansson, M. Rågberger, A. V. Papadopoulos, and T. Nolte, "Consistency Before Availability: Network Reference Point based Failure Detection for Controller Redundancy," Emerging Technologies and Factory Automation (ETFA), 12-15 September 2023, [DOI:10.1109/ETFA54631.2023.10275664](https://doi.org/10.1109/ETFA54631.2023.10275664)
The key idea in the NRP protocol is that when a backup fails to detect the heartbeats of a primary node, it becomes primary only if it has access to Network Reference Point (NRP), which is a point in the network. This way, if the network becomes partitioned, only a backup that is on the side of the partition that still has access to the NRP can become a primary. If a primary loses access to the NRP, then it relinquishes its primary role because it is now on the wrong side of a network partition. A backup on the right side of the partition will take over. The "FD" in the names of the programs stands for "fault detection."

## Prerequisite

Expand All @@ -15,8 +18,12 @@ To run these programs, you are required to first [install the RTI](https://www.l

<table>
<tr>
<td> <img src="img/HeartbeatBully.png" alt="Heartbeat Bully" width="100%"> </td>
<td> <a href="HeartbeatBully.lf"> HeartbeatBully.lf </a>: Basic leader electrion protocol called "heartbeat bully".</td>
</tr>
<tr>
<td> <img src="img/NRP_FD.png" alt="NRP_FD" width="100%"> </td>
<td> <a href="NRP_FD.lf"> NRP_FD.lf </a>: This version has switch1 failing at 3s, node1 failing at 10s, and node2 failing at 15s.</td>
<td> <a href="NRP_FD.lf"> NRP_FD.lf </a>: Extension using a network reference point (NRP) to help prevent multiple primaries. This version has switch1 failing at 3s, node1 failing at 10s, and node2 failing at 15s.</td>
</tr>
<tr>
<td> <img src="img/NRP_FD_PrimaryFails.png" alt="NRP_FD_PrimaryFails" width="100%"> </td>
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 39b70cb

Please sign in to comment.