sidebar_position | title |
---|---|
3 |
Jail Throttling |
- 2023-01-26: Initial Draft
- 2023-02-07: Property refined, ADR ready to review/merge
- 2023-11-22: Refactor for better understanding
Accepted
The CCV spec is based around the assumption that the provider binary and all consumers binaries are non-malicious, and follow the defined protocols. In practice, this assumption may not hold. A malicious consumer binary could potentially include code which is able to send many slash/jail packets at once to the provider.
Before the throttling feature was implemented, the following attack was possible. Attacker(s) would create provider validators just below the provider's active set. Using a malicious consumer binary, slash packets would be relayed to the provider, that would slash/jail a significant portion (or all) of honest validator at once. Control of the provider would then pass over to the attackers' validators. This enables the attacker(s) to halt the provider. Or even worse, commit arbitrary state on the provider, potentially stealing all tokens bridged to the provider over IBC.
The throttling feature was designed to slow down the mentioned attack from above, allowing validators and the community to appropriately respond to the attack, i.e., this feature limits (enforced by on-chain params) the rate that the provider validator set can be jailed over time.
Slash meter: There exists one slash meter on the provider which stores an amount of voting power (integer), corresponding to an allowance of validators that can be jailed over time. This meter is initialized to a certain value on genesis, decremented by the amount of voting power jailed whenever a slash packet is handled, and periodically replenished as decided by on-chain params.
Global entry queue: There exists a single queue which stores "global slash entries". These entries allow the provider to appropriately handle slash packets sent from any consumer in FIFO ordering. This queue is responsible for coordinating the order that slash packets (from multiple chains) are handled over time.
Per-chain data queue: For each established consumer, there exists a queue which stores "throttled packet data", i.e.,pending slash packet data is queued together with pending VSC matured packet data in FIFO ordering. Order is enforced by IBC sequence number. These "per-chain" queues are responsible for coordinating the order that slash packets are handled in relation to VSC matured packets from the same chain.
Note: The reason for a multiple-queue design is the VSC Maturity and Slashing Order property (see spec). There are other ways to ensure such a property (like a queue of linked lists, etc.), but the proposed approach seemed to be the most understandable and easiest to implement with a KV store.
SlashMeterReplenishPeriod
-- the period after which the slash meter is replenished.
SlashMeterReplenishFraction
-- the portion (in range [0, 1]) of total voting power that is replenished to the slash meter when a replenishment occurs. This param also serves as a maximum fraction of total voting power that the slash meter can hold.
MaxThrottledPackets
-- the maximum amount of throttled slash or vsc matured packets that can be queued from a single consumer before the provider chain halts, it should be set to a large value.
This param would allow provider binaries to panic deterministically in the event that packet throttling results in a large amount of state-bloat. In such a scenario, packet throttling could prevent a violation of safety caused by a malicious consumer, at the cost of provider liveness.
Upon the provider receiving a slash packet from any of the established consumers during block execution, two things occur:
- A global slash entry is queued.
- The data of such a packet is added to the per-chain queue.
Upon the provider receiving a VSCMatured packet from any of the established consumers during block execution, the VSCMatured packet data is added to the per-chain queue.
In the EndBlock
of the provider CCV module, there are three actions performed:
- replenish the slash meter;
- handle the leading
VSCMaturedPackets
; - and handle the throttle queues.
Once the slash meter becomes not full, it'll be replenished after SlashMeterReplenishPeriod
by incrementing the meter with its allowance for the replenishment block, where allowance
= SlashMeterReplenishFraction
* currentTotalVotingPower
.
The slash meter will never exceed its current allowance (function of the total voting power for the block) in value.
Note a few things:
- The slash meter can go negative in value, and will do so when handling a single slash packet that jails a validator with significant voting power. In such a scenario, the slash meter may take multiple replenishment periods to once again reach a positive value (or 0), meaning no other slash packets may be handled for multiple replenishment periods.
- Total voting power of a chain changes over time, especially as validators are jailed. As validators are jailed, total voting power decreases, and so does the jailing allowance. See below for more detailed throttling property discussion.
- The voting power allowance added to the slash meter during replenishment will always be greater than or equal to 1.
If the
SlashMeterReplenishFraction
is set too low, integer rounding will put this minimum value into effect. That is, ifSlashMeterReplenishFraction
*currentTotalVotingPower
< 1, then the effective allowance would be 1. This min value of allowance ensures that there's some packets handled over time, even if that is a very long time. It's a crude solution to an edge case caused by too small of a replenishment fraction.
The behavior described above is achieved by executing CheckForSlashMeterReplenishment()
every EndBlock
, BEFORE HandleThrottleQueues()
is executed.
In every block, it is possible that VSCMaturedPacket
data was queued before any slash packet data.
Since this "leading" VSCMatured packet data does not have to be throttled (see VSC Maturity and Slashing Order), we can handle all VSCMatured packet data at the head of the queue, before the any throttling or packet data handling logic executes.
In every EndBlock
, the following logic is executed to handle data from the throttle queues.
meter := getSlashMeter()
// Keep iterating as long as the meter has a positive (or 0) value, and global slash entries exist
while meter.IsPositiveOrZero() && entriesExist() {
// Get next entry in queue
entry := getNextGlobalSlashEntry()
// Decrement slash meter by the voting power that will be removed from the valset from handling this slash packet
valPower := entry.getValPower()
meter = meter - valPower
// Using the per-chain queue, handle the single slash packet using its queued data,
// then handle all trailing VSCMatured packets for this consumer
handleSlashPacketAndTrailingVSCMaturedPackets(entry)
// Delete entry in global queue, delete handled data
entry.Delete()
deleteThrottledSlashPacketData()
deleteTrailingVSCMaturedPacketData()
}
All CCV system properties should be maintained by implementing this feature, see CCV spec - Consumer Initiated Slashing.
One implementation-specific property introduced is that if any of the chain-specific packet data queues become larger than MaxThrottledPackets
, then the provider binary will panic, and the provider chain will halt.
Therefore this param should be set carefully. See SetThrottledPacketDataSize
.
This behavior ensures that if the provider binaries are queuing up more packet data than machines can handle, the provider chain halts deterministically between validators.
Using on-chain params and the sub protocol defined, slash packet throttling is implemented such that the following property holds under some conditions.
First, we introduce the following definitions:
- A consumer initiated slash attack "starts" when the first slash packet from such an attack is received by the provider.
- The "initial validator set" for the attack is the validator set that existed on the provider when the attack started.
- There is a list of honest validators such that if they are jailed,
X
% of the initial validator set will be jailed.
For the Throttling Property to hold, the following assumptions must be true:
- We assume the total voting power of the chain (as a function of delegations) does not increase over the course of the attack.
- No validator has more than
SlashMeterReplenishFraction
of total voting power on the provider. SlashMeterReplenishFraction
is large enough thatSlashMeterReplenishFraction
*currentTotalVotingPower
> 1, i.e., the replenish fraction is set high enough that we can ignore the effects of rounding.SlashMeterReplenishPeriod
is sufficiently longer than the time it takes to produce a block.
Note if these assumptions do not hold, throttling will still slow down the described attack in most cases, just not in a way that can be succinctly described. It's possible that more complex properties can be defined.
Throttling Property: The time it takes to jail/tombstone X
% of the initial validator set will be greater than or equal to
Intuition
Let's use the following notation:
$C$ : Number of replenishment cycles$P$ :$\mathit{SlashMeterReplenishPeriod}$ $F$ :$\mathit{SlashMeterReplenishFraction}$ $V_{\mathit{max}}$ : Max power of a validator as a fraction of total voting powerIn
$C$ number of replenishment cycles, the fraction of total voting power that can be removed,$a$ , is$a \leq F \cdot C + V_{\mathit{max}}$ (where$V_{\mathit{max}}$ is there to account for the power fraction of the last validator removed, one which pushes the meter to the negative value).So, we need at least
$C \geq \frac{a - V_{\mathit{max}}}{F}$ cycles to remove$a$ fraction of the total voting power.Since we defined the start of the attack to be the moment when the first slash request arrives, then
$F$ fraction of the initial validator set can be jailed immediately. For the remaining$X - F$ fraction of the initial validator set to be jailed, it takes at least$C \geq \frac{(X - F) - V_{\mathit{max}}}{F}$ cycles. Using the assumption that$V_{\mathit{max}} \leq F$ (assumption 2), we get$C \geq \frac{X - 2F}{F}$ cycles.In order to execute
$C$ cycles, we need$C \cdot P$ time.Thus, jailing the remaining
$X - F$ fraction of the initial validator set corresponds to$\frac{P \cdot (X - 2F)}{F}$ time.In other words, the attack must take at least
$\frac{P \cdot X}{F} - 2P$ time (in the units of replenish period$P$ ).
This property is useful because it allows us to reason about the time it takes to jail a certain percentage of the initial provider validator set from consumer initiated slash requests.
For example, if SlashMeterReplenishFraction
is set to 0.06
, then it takes no less than 4 replenishment periods to jail 33% of the initial provider validator set on the Cosmos Hub.
Note that as of writing this on 11/29/22, the Cosmos Hub does not have a validator with more than 6% of total voting power.
Note also that 4 replenishment period is a worst case scenario that depends on well crafted attack timings.
Note that the jailing allowance is directly proportional to the current total voting power of the provider chain. Therefore, if honest validators don't unjail themselves during the attack, the total voting power of the provider chain will decrease over the course of the attack, and the attack will be slowed down, main throttling property is maintained.
If honest validators do unjail themselves, the total voting power of the provider chain will still not become higher than when the attack started (unless new token delegations happen), therefore the main property is still maintained. Moreover, honest validators unjailing themselves helps prevent the attacking validators from gaining control of the provider.
In summary, the throttling mechanism as designed has desirable properties whether or not honest validators unjail themselves over the course of the attack.
- The described attack is slowed down in seemingly all cases.
- If certain assumptions hold, the described attack is slowed down in a way that can be precisely time-bounded.
- Throttling introduces a vector for a malicious consumer chain to halt the provider, see issue below. However, this is sacrificing liveness in a edge case scenario for the sake of security. As an improvement, using retries would fully prevent this attack vector.
- Additional state is introduced to the provider chain.
- VSCMatured and slash packet data is not always handled in the same block that it is received.