Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[close #772] Fix flaky integration tests #770

Merged
merged 11 commits into from
Oct 7, 2023
Merged

Conversation

pingyu
Copy link
Contributor

@pingyu pingyu commented Sep 24, 2023

What problem does this PR solve?

Issue Number: close #772

Problem Description: Integration tests in CI are flaky

What is changed and how does it work?

The reason of failure of integration tests in CI seems to be TiKV OOM.

We can see PD response with errors that "cluster is not bootstrapped" (see "Run Integration Test" in https://github.com/tikv/client-java/actions/runs/6390085322/job/17342615588?pr=770)

11949 [main] WARN  org.tikv.common.util.ConcreteBackOffer  - BackOffer.maxSleep 5000ms is exceeded, errors:
11949 [main] WARN  org.tikv.common.util.ConcreteBackOffer  - 
11.org.tikv.common.exception.GrpcException: 
ErrorType: PD_ERROR
Error: type: NOT_BOOTSTRAPPED
message: "cluster is not bootstrapped"

12.org.tikv.common.exception.GrpcException: 
ErrorType: PD_ERROR
Error: type: NOT_BOOTSTRAPPED
message: "cluster is not bootstrapped"

13.org.tikv.common.exception.GrpcException: 
ErrorType: PD_ERROR
Error: type: NOT_BOOTSTRAPPED
message: "cluster is not bootstrapped"

After print TiKV logs after failure, there is no more logs after TiKV print the "using config" (see "Print TiKV logs" in https://github.com/tikv/client-java/actions/runs/6390085322/job/17342615588?pr=770). It was very likely to be OOM killed.

Containers in Github action runner has only 7GB memory (see https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners). I think it's not enough for 2 PD and 2 TiKV, especially for newer versions.

To address the issue, this PR setup only one cluster in CI, and enable APIv2 to accept both RawKV & TxnKV test cases.

At the same time, keep 2 clusters for older versions, and cover the scenarios of APIv1 & APIv1 TTL.

Changes:

  • Add a new workflow for TiKV versions v6.5.3, v7.1.1 & nightly, which setup only one cluster and enable APIv2.
  • Optimize sleeping after cluster startup by waiting for the message "PD Endpoints" (see https://github.com/pingcap/tiup/blob/v1.13.1/components/playground/playground.go#L1108. p.s. Maybe we should add a feature to make this message stable)
  • Reduce raftstore.capacity as there is only 14GB disk in Github action runner.
  • Limit storage.block-cache.capacity to make less chances of OOM killed.
  • Set TiKV port explicitly by passing --kv.port xxx argument to TiUP. Otherwise TiKV of txnkv will complain about the port is occupied ([2023/10/03 09:02:28.144 +00:00] [FATAL] [server.rs:1165] ["127.0.0.1_20160 already in use, maybe another instance is binding with this address."], see https://github.com/tikv/client-java/actions/runs/6390868184/job/17345000675?pr=770). It may be an issue of TiUP as it should get a free port, but just work around this issue in this PR).
  • Print TiKV logs on failure for trouble shooting.

Code changes

  • No code

Check List for Tests

This PR has been tested by at least one of the following methods:

  • Integration test

Side effects

  • NO side effects

Related changes

  • Need to cherry-pick to the release branch

Signed-off-by: Ping Yu <[email protected]>
Signed-off-by: Ping Yu <[email protected]>
Signed-off-by: Ping Yu <[email protected]>
Signed-off-by: Ping Yu <[email protected]>
Signed-off-by: Ping Yu <[email protected]>
@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

All modified lines are covered by tests ✅

see 9 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

pingyu added 2 commits October 3, 2023 17:18
Signed-off-by: Ping Yu <[email protected]>
Signed-off-by: Ping Yu <[email protected]>
@pingyu pingyu changed the title [close #xxx] Fix flaky integration tests [close #772] Fix flaky integration tests Oct 3, 2023
Signed-off-by: Ping Yu <[email protected]>
@pingyu pingyu requested a review from iosmanthus October 3, 2023 10:45
@iosmanthus iosmanthus merged commit 3b503fb into tikv:master Oct 7, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tests: Integration tests in CI are flaky
2 participants