CDC replication stopped in chaos-mesh (ch-benchmark-pg-cdc) test #15141

fuyufjh · 2024-02-20T03:05:54Z

Describe the bug

After the CN being killed and restarted, the CDC tables seems to lose sync with their upstream PG.

$ psql -h localhost -p 4567 -d dev -U root

dev=> select * from new_order;
 no_o_id | no_d_id | no_w_id 
---------+---------+---------
    2288 |       1 |       1
    2299 |       1 |       1
    2357 |       1 |       1
    2366 |       1 |       1
    2389 |       1 |       1
.... too many rows ...
    2883 |      10 |       1
    2888 |      10 |       1
    2947 |      10 |       1
    2952 |      10 |       1
(8024 rows)

$ psql -h localhost -p 5432 -U postgres 

postgres=# select * from new_order;
 no_o_id | no_d_id | no_w_id 
---------+---------+---------
   13357 |      10 |       1
   13353 |       7 |       1
   13357 |       7 |       1
   13300 |       2 |       1
   12957 |       3 |       1
   13376 |       5 |       1
   13355 |       7 |       1
   13358 |       7 |       1
   13359 |       7 |       1
   13326 |       9 |       1
   13328 |       9 |       1
   13354 |       7 |       1
   13356 |       7 |       1
   13327 |       9 |       1
(14 rows)

BuildKite
Grafana
Logs

namespace: longcmkf-20240220-022651

Error message/log

2024-02-20T02:36:14.667064577Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/C90F020' on replication slot 'customer_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/C90F020, minimum is 0/15B79B60': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/C90F020, minimum is 0/15B79B60
2024-02-20T02:36:14.667078575Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/C9F2078' on replication slot 'new_order_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/C9F2078, minimum is 0/14EB0598': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/C9F2078, minimum is 0/14EB0598
2024-02-20T02:36:14.669016214Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/107346E0' on replication slot 'stock_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/107346E0, minimum is 0/15E74680': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/107346E0, minimum is 0/15E74680
2024-02-20T02:36:14.674653919Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/4EECA108' on replication slot 'history_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/4EECA108, minimum is 0/80CB5060': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/4EECA108, minimum is 0/80CB5060
2024-02-20T02:36:14.674658103Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/9831AC8' on replication slot 'item_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/9831AC8, minimum is 0/81998D50': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/9831AC8, minimum is 0/81998D50
2024-02-20T02:36:14.679035212Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/4EECA108' on replication slot 'warehouse_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/4EECA108, minimum is 0/81086F00': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/4EECA108, minimum is 0/81086F00
2024-02-20T02:36:15.826899745Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/C9F2078' on replication slot 'new_order_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/C9F2078, minimum is 0/14EB0598': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/C9F2078, minimum is 0/14EB0598
2024-02-20T02:36:15.849238513Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/C90F020' on replication slot 'customer_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/C90F020, minimum is 0/15B79B60': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/C90F020, minimum is 0/15B79B60
2024-02-20T02:36:15.852313048Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/107346E0' on replication slot 'stock_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/107346E0, minimum is 0/15E74680': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/107346E0, minimum is 0/15E74680
2024-02-20T02:36:16.752002682Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/9831AC8' on replication slot 'item_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/9831AC8, minimum is 0/81998D50': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/9831AC8, minimum is 0/81998D50
2024-02-20T02:36:16.768027687Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/4EECA108' on replication slot 'history_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/4EECA108, minimum is 0/80CB5060': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/4EECA108, minimum is 0/80CB5060
2024-02-20T02:36:17.341700037Z  WARN risingwave_connector_node: Cannot seek to the last known offset '0/4EECA108' on replication slot 'warehouse_slot'. Error from server: 'ERROR: cannot advance replication slot to 0/4EECA108, minimum is 0/81086F00': org.postgresql.util.PSQLException: ERROR: cannot advance replication slot to 0/4EECA108, minimum is 0/81086F00

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

StrikeW · 2024-02-22T14:46:47Z

Some background go first

Our cdc connector consumes PG cdc events while acking to the PG server at regular intervals the offset (lsn) that has been consumed. Then upstream PG will assume that wal log of those offsets can be discarded.

risingwave/java/connector-node/risingwave-connector-service/src/main/java/com/risingwave/connector/source/core/DbzCdcEventConsumer.java

Lines 189 to 199 in e6d8d88

    
               committer.markProcessed(event); 
        
           } 
        
           // skip empty batch 
        
           if (respBuilder.getEventsCount() > 0) { 
        
               respBuilder.setSourceId(sourceId); 
        
               var response = respBuilder.build(); 
        
               outputChannel.put(response); 
        
           } 
        
           committer.markBatchFinished();

DebeziumEngine will commit those marked offsets to upstream:
https://github.com/debezium/debezium/blob/4ca2a67b0d302c611b89b1931728377cf232ab6c/debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresStreamingChangeEventSource.java#L435-L436

Findings

After some investigation, I think the reason for the "Cannot seek to the last known offset" error is that we ack the offset to PG before the checkpoint commit. So that when the cluster recovered from a committed checkpoint, the restored offset may already been discarded by upstream PG.

Currently our framework doesn't have a checkpoint commit callback mechanism to notify the source executor. An intuitive idea is to let Meta make a broadcast RPCs to each CNs in the cluster. cc @hzxa21

To confirm the findings, I increase the offset flush interval to 30mins which is much large than the time required for the test and rerun the chaos test (stresschaos only and w/o memtable spill: 599, 603 ), the results show that the "Cannot seek" error is gone and btw mv check is passed.

But! when I run chaos test with 3CNs (601) , even the error is gone, the source table is still unsynced with PG, I have no idea of this one right now.

fuyufjh · 2024-02-23T08:02:00Z

Currently our framework doesn't have a checkpoint commit callback mechanism to notify the source executor.

+1 for this.

But that will certainly take some time. Before that, shall we use some hacky way to work around the problem to unblock the CH-Benchmark chaos test?

For example, set PG's wal_keep_size to higher number to let PG keeps more WAL. Note that this only works for our testing env. Users will still run into this problem.

StrikeW · 2024-02-26T04:03:10Z

For example, set PG's wal_keep_size to higher number to let PG keeps more WAL. Note that this only works for our testing env. Users will still run into this problem.

Will confirm whether it can work.

hzxa21 · 2024-02-28T04:29:06Z

To confirm the findings, I increase the offset flush interval to 30mins which is much large than the time required for the test and rerun the chaos test (stresschaos only and w/o memtable spill: 599, 603 ), the results show that the "Cannot seek" error is gone and btw mv check is passed.

But! when I run chaos test with 3CNs (601) , even the error is gone, the source table is still unsynced with PG, I have no idea of this one right now.

Is it possible that when the actor/executor is dropped during recovery, the offset is force flushed regardless of what the offset flush interval is?

lmatz · 2024-03-05T09:02:03Z

https://github.com/risingwavelabs/kube-bench/pull/408
wal_keep_size now allows customization

StrikeW · 2024-03-06T09:23:36Z

But! when I run chaos test with 3CNs (601) , even the error is gone, the source table is still unsynced with PG, I have no idea of this one right now.

I reproduced the problem in job 646. And the new_order in RW has more rows than PG.

After confirmed with @cyliu0, the workload has DELETE operations on upstream PG. So the problem is why upstream DELETE events doesn't ingested into RW.

The conclusion should be the dedicated source is not tolerant recovery (by design) during the progress of initial snapshot loading. And the cluster is crashed during snapshot loading, then after recovery the cdc source will initiate a new snapshot again and consume from a new lsn offset so that the rows in RW cannot be deleted.

We can try to use PG share source which has recoverable backfill to confirm the above findings. https://risingwave-labs.slack.com/archives/C064SBT0ASF/p1709632464607879

Since dedicated cdc source doesn't support recoverable initial snapshot loading, we should ensure the historical data is empty in chaos-mesh test for it . cc @lmatz @cyliu0

hzxa21 · 2024-03-07T03:17:58Z

But! when I run chaos test with 3CNs (601) , even the error is gone, the source table is still unsynced with PG, I have no idea of this one right now.

I reproduced the problem in job 646. And the new_order in RW has more rows than PG. After confirmed with @cyliu0, the workload has DELETE operations on upstream PG. So the problem is why upstream DELETE events doesn't ingested into RW.

The conclusion should be the dedicated source is not tolerant recovery (by design) during the progress of initial snapshot loading. And the cluster is crashed during snapshot loading, then after recovery the cdc source will initiate a new snapshot again and consume from a new lsn offset so that the rows in RW cannot be deleted.

We can try to use PG share source which has recoverable backfill to confirm the above findings. https://risingwave-labs.slack.com/archives/C064SBT0ASF/p1709632464607879

Since dedicated cdc source doesn't support recoverable initial snapshot loading, we should ensure the historical data is empty in chaos-mesh test for it . cc @lmatz @cyliu0

Let me see whether I understand it correct. What happened was:

INSERT row1 (wal offset = 0) into PG table.
PG -> RW CDC table created.
RW CDC table perform snapshot read and then consuming wal offset = 0. RW table = [row1]
RW recovery starts.
DELETE row1 (wal offset = 1) from PG.
INSERT row2 (wal offset = 2) into PG,
RW recovery finishes.
RW CDC table perform snapshot read (read row2) and then consuming wal offset = 2.
RW table = [row1, row2]. row1 delete is missing

hzxa21 · 2024-03-07T03:21:49Z

Will the CREATE TABLE statement be blocked until the initial snapshot loading finishes for dedicated source CDC? I think we can simply resolve this issue by dropping/clearing the table if it fails before initial snapshot loading and starts clean after recovery.

StrikeW · 2024-03-07T03:30:14Z

Will the CREATE TABLE statement be blocked until the initial snapshot loading finishes for dedicated source CDC?

No. it won't block. For chaos test purpose, I recommend ensuring historical data is empty when create RW cdc tables for dedicated cdc sources.

StrikeW · 2024-04-01T03:27:22Z

Since dedicated cdc source doesn't support recoverable initial snapshot loading, we should ensure the historical data is empty in chaos-mesh test for it . cc @lmatz @cyliu0

As I mentioned, it is not a bug and this issue can be closed.

fuyufjh added the type/bug Something isn't working label Feb 20, 2024

github-actions bot added this to the release-1.7 milestone Feb 20, 2024

StrikeW self-assigned this Feb 20, 2024

fuyufjh mentioned this issue Feb 20, 2024

PG cdc q3 checksums inconsistent #15057

Closed

fuyufjh added priority/high block-release-v1.7 labels Feb 20, 2024

lmatz added the found-by-chaos-mesh label Feb 22, 2024

lmatz removed the block-release-v1.7 label Feb 22, 2024

lmatz added the block-release-v1.8 label Feb 27, 2024

xuefengze mentioned this issue Mar 5, 2024

ch-benchmark-pg-cdc data verification failed in daily chaos mesh test #15434

Closed

fuyufjh mentioned this issue Mar 5, 2024

feat: checkpoint commit callback (for cdc-connector) #15464

Closed

StrikeW modified the milestones: release-1.7, release-1.8 Mar 6, 2024

StrikeW closed this as completed Apr 1, 2024

StrikeW mentioned this issue Apr 1, 2024

chaos mesh daily test: ch-benchmark-pg-cdc data verification failed #15312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDC replication stopped in chaos-mesh (ch-benchmark-pg-cdc) test #15141

CDC replication stopped in chaos-mesh (ch-benchmark-pg-cdc) test #15141

fuyufjh commented Feb 20, 2024 •

edited by StrikeW

Loading

StrikeW commented Feb 22, 2024 •

edited

Loading

fuyufjh commented Feb 23, 2024

StrikeW commented Feb 26, 2024

hzxa21 commented Feb 28, 2024 •

edited

Loading

lmatz commented Mar 5, 2024

StrikeW commented Mar 6, 2024 •

edited

Loading

hzxa21 commented Mar 7, 2024

hzxa21 commented Mar 7, 2024

StrikeW commented Mar 7, 2024 •

edited

Loading

StrikeW commented Apr 1, 2024 •

edited

Loading

CDC replication stopped in chaos-mesh (ch-benchmark-pg-cdc) test #15141

CDC replication stopped in chaos-mesh (ch-benchmark-pg-cdc) test #15141

Comments

fuyufjh commented Feb 20, 2024 • edited by StrikeW Loading

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

StrikeW commented Feb 22, 2024 • edited Loading

Some background go first

Findings

fuyufjh commented Feb 23, 2024

StrikeW commented Feb 26, 2024

hzxa21 commented Feb 28, 2024 • edited Loading

lmatz commented Mar 5, 2024

StrikeW commented Mar 6, 2024 • edited Loading

hzxa21 commented Mar 7, 2024

hzxa21 commented Mar 7, 2024

StrikeW commented Mar 7, 2024 • edited Loading

StrikeW commented Apr 1, 2024 • edited Loading

fuyufjh commented Feb 20, 2024 •

edited by StrikeW

Loading

StrikeW commented Feb 22, 2024 •

edited

Loading

hzxa21 commented Feb 28, 2024 •

edited

Loading

StrikeW commented Mar 6, 2024 •

edited

Loading

StrikeW commented Mar 7, 2024 •

edited

Loading

StrikeW commented Apr 1, 2024 •

edited

Loading