Failure Detector

Group Membership Failure Detector

  • When discussing deployment of a Jiffy-generated application, each instance of the application is referred to as a process, application instance or node.

  • Processes may fail.

  • Processes exist with one of five publicly disseminated and well-known statuses: {ACTIVE; SUSPECT; FAILED; DEPARTING; DEPARTED}

  • Processes exist with a publicly (within the group) disseminated incarnation number.

  • Process viability is checked at selectable intervals (ping-cycle time) via a configurable ping-ack mechanism.

  • Each process pings all other processes in the group in random order once per ping-cycle. Each process is guaranteed to ping every other process in the group in (2n-1) pings, where n is the number of non-failed processes in the process-list.

  • Processes failing a ping-ack exchange are not immediately failed but are moved into the SUSPECT state.

  • The failure-threshold outlining the number of permissible ping-ack failures for a SUSPECT process is configurable via the ‘failure_threshold’ key in the application configuration file. At the moment, the configuration value is used directly, but in large groups or groups where the number of active members varies greatly it would make sense to use the configuration value in combination with the number of active processes in the group to determine an appropriate SUSPECT to FAILURE threshold value.

  • The local status-count of the SUSPECT process is incremented once for each failed ping or notification of a failed Ping from another process in the group.

  • Process status is disseminated across all processes by way of a gossip-based infection-style piggybacking of process information on top of a process’s outgoing Ping messages. For example:

    Pinging process (Pi) sends a Ping containing it’s local group-membership list to a target process (Pj). Process Pj receives the Ping, takes note of the included group-membership list from Pi, updates its own group-membership list based on the received process status information, then sends an Ack message back to > Pi. Ping status scenarios are broken out in a subsequent section.

  • Once a SUSPECT process (Pj) reaches the failure-threshold in any pinging process (Pi), process Pi moves process Pj from the SUSPECT state to the FAILED state in its local group-membership list. A FAILED status is immutable, and the failed process will be removed from the group via the ping-based status dissemination.

  • If a process (Pi) receives a Ping message from another process (Pj) indicating that Pi is in a SUSPECT state, process Pi increments it’s incarnation number and continues to participate in the ping-cycle. Processes receiving pings from process Pi take note of Pi’s new incarnation number and set their internal record of Pi back to an ACTIVE status. However, if any process in the group has already set Pi’s status to FAILED this will override Pi’s incarnation number-based signal of liveness and all members will set their Pi status to FAILED after which Pi will be removed from the group.

  • It may be desirable to enhance the application slightly in this area to cause process-failures to trigger some sort of notification.

Typically the implementer should not have to worry about the failure detector beyond establishing reasonable limits for SUSPECT status Ping failures and setting a reasonable ping-cycle time in the configuration file.

Process Status Scenarios

Consider a group-membership of {P1; P2; P3} where failure-threshold == 3:

P1 Ping-Cycle

  • (P1-Ping) -> (P2)
  • (P2-Ack) -> (P1)
  • (P1-Ping) -> (P3)
  • (P3-Ack) -> (P1)
  • (P1-Ping) -> (P1)
  • (P1-Ack) -> (P1)

P2 Ping-Cycle

  • (P2-Ping) -> (P1>
  • (P1-Ack) -> (P2)
  • (P2-Ping) -> (P3)
  • (P3-Ack) -> (P2)
  • (P2-Ping) -> (P2)
  • (P2-Ack) -> (P2)

P3 Ping-Cycle

  • (P3-Ping) -> (P1)
  • (P1-Ack) -> (P3)
  • (P3-Ping) -> (P2)
  • (P2-Ack) -> (P3)
  • (P3-Ping) -> (P3)
  • (P3-Ack) -> (P3)

Example 1

(P1-Ping-1) -> (P2)

(P2-NoAck)

P1 moves P2’s status to SUSPECT in the local P1 group-membership list and sets the P2 status-count back to 1.

… ping-cycle completes


(P1-Ping-2) -> (P2)

(P2-NoAck)

P1 leaves P2’s status as SUSPECT in the local P1 group-membership list and increments the P2 status-count by 1 (==2).

… ping-cycle completes


(P1-Ping-3) -> (P2)

(P2-NoAck)

P1 moves P2’s status to FAILED in the local P1 group-membership list and increments the P2 status-count by 1 (==3).

… ping-cycle completes

As P1’s group-membership list is being updated with the P2 ping failures, P1 is disseminating that information to the other processes in the group. It is likely that other processes in the group have also noticed that P2 is not responding, so these processes are also disseminating the P2 SUSPECT status to all other processes in the group - including P1. As a result, failing processes will accrue SUSPECT status counts rapidly in each process’s group-membership list; the failure threshold value should be computed as a product of the number of running processes in the group (TODO).

Example 2

(P1-Ping-1) -> (P2)

(P2-NoAck)

P1 moves P2’s status to SUSPECT in the local P1 group-membership list and sets the P2 status-count back to 1.

… ping -cycle completes


(P1-Ping-2) -> (P2)

(P2-NoAck)

P1 leaves P2’s status as SUSPECT in the local P1 group-membership list and increments the P2 status-count by 1 (==2).

… ping-cycle completes


(P1-Ping-3) -> (P2)

(P2-Ack) -> (P1)

P1 may set P2’s status back to ALIVE in the local P1 group-membership list and reset the P2 status-count back to 1).

… ping-cycle completes


Why would P1 not guarantee the reset of P2’s status to the ALIVE state after receiving an Ack for last Ping message?

As the ping-cycles run, the possibility exists that if P2 were to come back online it may receive a Ping message containing a group-membership list where it (P2) is in a SUSPECT state. If P2 sees that it is SUSPECT, it reacts by incrementing it’s incarnation-number which is then relayed to the group-members via P2’s ping messages. When a process (let’s say P1), sees that SUSPECT process P2 has sent a Ping, it checks the embedded incarnation-number to see if it is greater than the P2 incarnation-number presently stored in P1’s local group-membership list. If the incarnation-number contained in the Ping is greater, P1 updates it’s group-membership list to indicate that P2 is ALIVE with a new incarnation number. If the incarnation-number in the (P2-Ping) is lower or the same as that of the P2 record in the local P1 group-membership list, the Ping is discarded.

P2 can still send the Ack’s in response to Pings coming from other processes, and due to the concurrent nature of the service, it may well send a Ping to the other group-members (P1 for example) in time to prevent itself from being failed, but this is an unavoidable edge-case. When a SUSPECT process recovers to the extent that it is reachable again, and one or more group-members are very close to reaching the failure-threshold for the process in question, there is a chance that the recovered process will be failed anyway.

There is a lot more to write here in order to completely describe the mechanics of how the failure detector works. Additionally, the failure detector operates using a bully-type algorithm that has been abridged slightly to favor the highest known process as the new leader is a current-leader failure scenario. Highest known process is determined locally through inspection of each application instance’s group-membership map. This is in contrast to the typical bully implementation where ELECT-OK messages are used alongside local introspection. The codebase contains a full bully algorithm implementation, but the ELECT-OK message pair have been commented out for now. There is a tentative plan to do one or both of the following:

  1. Reimplement the failure detector using the Raft algorithm.
  2. Retool the jiffy application to create applications tightly integrated into the Kubernetes / CoreOS ecosystems.