Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11773. Prevent frequent DataNode Ratis snapshotting. #7473

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jojochuang
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-11773. Bump hdds.ratis.snapshot.threshold and hdds.container.ratis.statemachine.max.pending.apply-transactions to 100k

Please describe your PR in detail:

  • DataNode is configured to snapshot every 10k transactions, which is too frequent for HBase workloads where DataNode can do thousands and eventually 10s of thousands of transactions per second.
  • Bump hdds.ratis.snapshot.threshold to 100k for now. Update hdds.container.ratis.statemachine.max.pending.apply-transactions to make it consistent.
  • We have to revisit and make it 1M eventually.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11773

How was this patch tested?

Applied the change to a HBase cluster. Previously it was snapshotting every 4-5 seconds, and now it is doing it about every minute.

Change-Id: I2baf863c537cc3f5b0e2905c2fb1ca88d05c0ff2
@jojochuang jojochuang changed the title HDDS-11773. Frequent DataNode Ratis snapshotting. HDDS-11773. Prevent frequent DataNode Ratis snapshotting. Nov 22, 2024
Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jojochuang .

But this also changes OM's default. It could be a concern when Ratis snapshot interval is set to a value too high for followers to catch up, thus failing OM bootstrapping? Is there an existing mechanism to tune this for Datanodes only?

What do you think? @szetszwo

@jojochuang
Copy link
Contributor Author

isn't it DataNode only?
SCM uses ozone.scm.ha.ratis.snapshot.threshold.
I guess OM uses the default value which is 400000.

@smengcl
Copy link
Contributor

smengcl commented Nov 23, 2024

isn't it DataNode only? SCM uses ozone.scm.ha.ratis.snapshot.threshold. I guess OM uses the default value which is 400000.

Right. Pls amend the config tag

@@ -279,15 +279,15 @@
</property>
<property>
<name>hdds.ratis.snapshot.threshold</name>
<value>10000</value>
<value>100000</value>
<tag>OZONE, RATIS</tag>
Copy link
Contributor

@smengcl smengcl Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<tag>OZONE, RATIS</tag>
<tag>OZONE, CONTAINER, RATIS</tag>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not entire sure if it should be tagged CONTAINER or DATANODE

<tag>OZONE, RATIS</tag>
<description>Number of transactions after which a ratis snapshot should be
taken.
</description>
</property>
<property>
<name>hdds.container.ratis.statemachine.max.pending.apply-transactions</name>
<value>10000</value>
<value>100000</value>
<tag>OZONE, RATIS</tag>
Copy link
Contributor

@smengcl smengcl Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<tag>OZONE, RATIS</tag>
<tag>OZONE, CONTAINER, RATIS</tag>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants