You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently replicants have to copy the entire contents of the tables when they reconnect, even after a short while. With large enough volume of data it may hinder cluster recovery after disaster or maintenance. Moreover, it makes it almost impossible to rebalance the load on the core nodes.
Initially we tried to solve this problem by persisting the transaction log, so the replicants recovering after a reconnect could replay it instead of going through the entire bootstrap procedure.
That approach proved to hurt performance too much to be practical. In addition, flapping client connections can often generate delete -> add -> delete -> ... loops in the transaction logs, making them larger than the table itself, making the whole idea of replaying transaction log questionable.
Below I describe an alternative approach. Instead of trying to avoid bootstrap, we could speed it up.
For each shard we create so called "delta set", that consists of N set-like tables (plain ets or plain rocksdb) that store the following records: {{Table, Key}, X} where X is a value of a counter.
Every M seconds we rotate the tables in the delta set in a ring buffer fashion, contents of the oldest table are dropped. We also increase counter X by 1.
mria_rlog_server process, as it processes intercepted transactions, writes each affected key to the table with the current delta set table. The existing keys are simply overwritten.
When a replicant connects to the core, it tells it its current logical timestamp in the hello message
The core checks if the timestamp is covered by its delta set.
If the replicant's data is too old, bootstrap server does the normal loop.
Pitfalls:
Clock skews. Possible solution: make sure to replay much older keys than the replicant's timestamp.
Care should be taken to avoid "skipping" over a part of delta set, as tables get rotated while the bootstrap is running. This could be prevented, perhaps, by making sure X counter doesn't change during bootstrap, and only increments by 1 while jumping to the next table.
When the core node itself restarts, or rlog_server process restarts, it can skip certain keys. In this situation replaying the delta set would lead to inconsistent results, so the best course of action would be to simply drop it. It means any replicant node that connects to a freshly restarted core node has to bootstrap from scratch.
The text was updated successfully, but these errors were encountered:
I assume one needed step is to periodically update the replicants. Maybe, each time the delta set tables rotate, the cores would broadcast the new logical timestamp to the replicants?
Also, I guess clock skews would come in play here in the sense that, although the timestamps would be just counters, each core node could diverge on which is the current timestamp. That is, each core might be at a different "time". So replicants might need to keep track of current timestamps per core rather than just per table.
Currently replicants have to copy the entire contents of the tables when they reconnect, even after a short while. With large enough volume of data it may hinder cluster recovery after disaster or maintenance. Moreover, it makes it almost impossible to rebalance the load on the core nodes.
Initially we tried to solve this problem by persisting the transaction log, so the replicants recovering after a reconnect could replay it instead of going through the entire bootstrap procedure.
That approach proved to hurt performance too much to be practical. In addition, flapping client connections can often generate
delete -> add -> delete -> ...
loops in the transaction logs, making them larger than the table itself, making the whole idea of replaying transaction log questionable.Below I describe an alternative approach. Instead of trying to avoid bootstrap, we could speed it up.
set
-like tables (plain ets or plain rocksdb) that store the following records:{{Table, Key}, X}
whereX
is a value of a counter.X
by 1.mria_rlog_server
process, as it processes intercepted transactions, writes each affected key to the table with the current delta set table. The existing keys are simply overwritten.clear_table
command (https://github.com/emqx/mria/blob/main/src/mria_bootstrapper.erl#L240), so the replicant preserves its local data.Pitfalls:
X
counter doesn't change during bootstrap, and only increments by 1 while jumping to the next table.The text was updated successfully, but these errors were encountered: