-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDC replication stopped in chaos-mesh (ch-benchmark-pg-cdc) test #15141
Comments
Some background go firstOur cdc connector consumes PG cdc events while acking to the PG server at regular intervals the offset (lsn) that has been consumed. Then upstream PG will assume that wal log of those offsets can be discarded. Lines 189 to 199 in e6d8d88
DebeziumEngine will commit those marked offsets to upstream: FindingsAfter some investigation, I think the reason for the "Cannot seek to the last known offset" error is that we ack the offset to PG before the checkpoint commit. So that when the cluster recovered from a committed checkpoint, the restored offset may already been discarded by upstream PG. Currently our framework doesn't have a checkpoint commit callback mechanism to notify the source executor. An intuitive idea is to let Meta make a broadcast RPCs to each CNs in the cluster. cc @hzxa21 To confirm the findings, I increase the offset flush interval to 30mins which is much large than the time required for the test and rerun the chaos test (stresschaos only and w/o memtable spill: 599, 603 ), the results show that the "Cannot seek" error is gone and btw mv check is passed. But! when I run chaos test with 3CNs (601) , even the error is gone, the source table is still unsynced with PG, I have no idea of this one right now. |
+1 for this. But that will certainly take some time. Before that, shall we use some hacky way to work around the problem to unblock the CH-Benchmark chaos test? For example, set PG's |
Will confirm whether it can work. |
Is it possible that when the actor/executor is dropped during recovery, the offset is force flushed regardless of what the offset flush interval is? |
https://github.com/risingwavelabs/kube-bench/pull/408 |
I reproduced the problem in job 646. And the The conclusion should be the dedicated source is not tolerant recovery (by design) during the progress of initial snapshot loading. And the cluster is crashed during snapshot loading, then after recovery the cdc source will initiate a new snapshot again and consume from a new lsn offset so that the rows in RW cannot be deleted. We can try to use PG share source which has recoverable backfill to confirm the above findings. https://risingwave-labs.slack.com/archives/C064SBT0ASF/p1709632464607879 Since dedicated cdc source doesn't support recoverable initial snapshot loading, we should ensure the historical data is empty in chaos-mesh test for it . cc @lmatz @cyliu0 |
Let me see whether I understand it correct. What happened was:
|
Will the |
No. it won't block. For chaos test purpose, I recommend ensuring historical data is empty when create RW cdc tables for dedicated cdc sources. |
Describe the bug
After the CN being killed and restarted, the CDC tables seems to lose sync with their upstream PG.
BuildKite
Grafana
Logs
namespace: longcmkf-20240220-022651
Error message/log
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: