Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sink): do compact per barrier with the sink pk in the sink executor #13411

Closed
st1page opened this issue Nov 14, 2023 · 6 comments
Closed
Assignees
Milestone

Comments

@st1page
Copy link
Contributor

st1page commented Nov 14, 2023

This issue propose to always do compaction in the sink executor for those sink with the key/primary key. because

  1. We have added this behavior when the feat(sink): handle stream key sink pk mismatch #12458. To make the behavior consistent.
  2. some of users felt weird when RW sink unnecessary intermediate result. sink emitting more rows than expected, followup of #10853 #13025

cons

After that, the freshness will be bound on the barrier interval. user need to config the barrier interval to achieve better freshness.

FORCE APPEND ONLY SINK

thanks @xiangjinwu #9443 (comment). The primary key defined on MQ is actually the partition key. MQ can maintain the order of events with the same key.
for the normal append only sink, RW will never retract the records and records can be reordered anyway based on the SQL semantics. But we have a "force append only" semantic widely used to convert a retractable stream to append only stream by only maintaining the insert operation. Under this definition. The order of events in the specified key matters.
So the common per barrier compact will also be done on the force append only sink.

@st1page st1page self-assigned this Nov 14, 2023
@github-actions github-actions bot added this to the release-1.5 milestone Nov 14, 2023
@fuyufjh
Copy link
Member

fuyufjh commented Nov 14, 2023

The motivations didn't fully convince me.

some of users felt weird when RW sink unnecessary intermediate result.

For these cases, I think

  • If the intermediate results come from RisingWave, we should try to eliminate this in executors.
  • If the intermediate results come from user's upstream, it's okay to output them.

"Compact per barrier" was proposed to solve the "sink key & stream key mismatch" problem. Here it sounds like an abuse of it.

@st1page
Copy link
Contributor Author

st1page commented Nov 14, 2023

If the intermediate results come from RisingWave, we should try to eliminate this in executors.

Some efficient computation of the executors results that such as projectSet

@hzxa21
Copy link
Collaborator

hzxa21 commented Nov 15, 2023

Is it possible to do the compaction on the fly during project and projectset? I imagine that only the adjacent items with the same key need to be checked and compacted. Will the overhead be significant?

@st1page
Copy link
Contributor Author

st1page commented Nov 15, 2023

Is it possible to do the compaction on the fly during project and projectset? I imagine that only the adjacent items with the same key need to be checked and compacted. Will the overhead be significant?

Yes, we can do that in some place to optimize #13409

But the question here is that we have in some cases, it is difficult to maintain the update delete and update insert be adjacent.

e.g. 1: update join key/ over window partition key, the old value and new value could be shuffled into different partitions.
e.g. 2: vectorized expand/projectset. #4894 (comment)

@fuyufjh
Copy link
Member

fuyufjh commented Nov 15, 2023

If the intermediate results come from RisingWave, we should try to eliminate this in executors.

Some efficient computation of the executors results that such as projectSet

Due to our previous discussion, in my mind, we finally decided to choose the Option 2 from these 2 options below:

  1. Completely remove UpdateDelete & UpdateInsert and never rely on them any more
  2. Keep UpdateDelete & UpdateInsert and rely on them to identify Update as much as possible, although it's not 100% guaranteed but in the best effort way

Today I am just trying to follow the option 2.

While for these exceptions you mentioned:

e.g. 1: update join key/ over window partition key, the old value and new value could be shuffled into different partitions.
e.g. 2: vectorized expand/projectset. #4894 (comment)

1 is inevitable. 2 is a trade-off for us.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 15, 2023

I understand that you want to solve an issue completely. The problem is, IMO, the proposed solution isn't free, and the price seems to be even higher than the benefit.

@st1page st1page modified the milestones: release-1.5, release-1.6 Dec 5, 2023
@st1page st1page closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants