Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: kafka source performance regression due to tracing #12959

Closed
lmatz opened this issue Oct 19, 2023 · 20 comments · Fixed by #13105
Closed

perf: kafka source performance regression due to tracing #12959

lmatz opened this issue Oct 19, 2023 · 20 comments · Fixed by #13105
Assignees
Labels
help wanted Issues that need help from contributors needs-investigation type/perf
Milestone

Comments

@lmatz
Copy link
Contributor

lmatz commented Oct 19, 2023

image

To verify if it is the environment's impact, we re-run a nightly-20231001 image as the last data point, its performance is good.

The first drop happens between nightly-20231008 and nightly-20231009, the second drop happens between nightly-20231012 and nightly-20231015

commits between 1008 and 1009


## nightly-20231009
- 5ae9a360daeeefba400e44d066c76e2bc2526ff0 chore: Update Slack link to a permanent one in README (#12700) https://github.com/risingwavelabs/risingwave/pull/12700
- 2fef0544a7ec4c4fe9921e35f421793c6e68b29a fix: add "ts_ms" in source (#12703) https://github.com/risingwavelabs/risingwave/pull/12703
- 0c4013cc3a95caa465c51028c5d80c1f33ca413b chore(deps): bump etcd-client to v0.12.1 (#12705) https://github.com/risingwavelabs/risingwave/pull/12705
- 9db52aab732b4a851ec397e91c2182b7f5e58056 ci: fix disk out of space for GHA (attempt 2) (#12671) https://github.com/risingwavelabs/risingwave/pull/12671
- bfdafe1b65741b11c8d998b052c1ae1bbff697dd refactor(common): refine `Row` changes in #11070 (#12652) https://github.com/risingwavelabs/risingwave/pull/12652
- da5099bbcc1a26670d7e040dd1b65070f03e1652 perf(stream agg): flush dirty agg groups according to heap size (#12453) https://github.com/risingwavelabs/risingwave/pull/12453
- 2a7ef0d808f17a0e82ee72e6376e7f77c2ba07aa fix(log-store): handle unaligned epoch after recovery (#12407) https://github.com/risingwavelabs/risingwave/pull/12407
- 456f3e98d7a00e3c383f8b03ee40b5da09af58d9 chore(storage): upgrade write limit (#12658) https://github.com/risingwavelabs/risingwave/pull/12658
- 54e8c759a295ff9ac6888b34520bdb1c48a6c363 chore: Update Slack link in README.md (#12697) https://github.com/risingwavelabs/risingwave/pull/12697
- b4f357c22d825a6caad31011e718ab88c41f1bc2 feat(connector): structural logging for parsing error (#12659) https://github.com/risingwavelabs/risingwave/pull/12659
- fbd54d89b05a27f9f8da803e0e29d3b0c4b2e366 fix(storage): fix block format comment (#11664) https://github.com/risingwavelabs/risingwave/pull/11664
- da1e7bed1a6eaf2180ec4f5c57d189227ff08a99 chore: update cherry pick action version (#12692) https://github.com/risingwavelabs/risingwave/pull/12692
- 31cf3f8f1b827fad5572a6a188507a01f5b396ba fix(storage): fix compact_fast_runner_bytes unused (#12655) https://github.com/risingwavelabs/risingwave/pull/12655
- fcdc3a112bb64ac151d44cc48f0bb143d92d456d perf(expr): avoid parsing unit every time on `extract` and `date_part` (#12398) https://github.com/risingwavelabs/risingwave/pull/12398
- ca2509546d056c863752bd4b7568cf0aa5559fe6 chore(deps): Bump byteorder from 1.4.3 to 1.5.0 (#12666) https://github.com/risingwavelabs/risingwave/pull/12666
- a8b25ac3133c1f65bffa467bcfc884b949cca23d fix(doc): fix debian-like installation guide (#12684) https://github.com/risingwavelabs/risingwave/pull/12684
- 8b082dea901084c5f9a6462789e18f77dbe3efc0 chore(deps): Bump similar from 2.2.1 to 2.3.0 (#12668) https://github.com/risingwavelabs/risingwave/pull/12668

## nightly-20231008

We notice that q0 is a stateless query that does no computation and no state access, so likely to be some issue related to parsing I guess?

@lmatz lmatz added help wanted Issues that need help from contributors type/perf needs-investigation labels Oct 19, 2023
@github-actions github-actions bot added this to the release-1.4 milestone Oct 19, 2023
@huangjw806
Copy link
Contributor

cc @st1page

@st1page
Copy link
Contributor

st1page commented Oct 19, 2023

q0
nightly-20231008 compute CPU 620%~640% higher source throughput, I guess it is source-bound.
nightly-20231009 compute CPU 780% less source throughput,I guess it is CPU bound.

So we need to check the CPU perf flame graph to find out which part consumes the CPU

@lmatz
Copy link
Contributor Author

lmatz commented Oct 19, 2023

@lmatz
Copy link
Contributor Author

lmatz commented Oct 19, 2023

@st1page

This comment was marked as resolved.

@lmatz
Copy link
Contributor Author

lmatz commented Oct 19, 2023

SCR-20231019-naq
SCR-20231019-nau

do_action +6%

@st1page
Copy link
Contributor

st1page commented Oct 19, 2023

In summary, the RisingWave runtime CPU percentage decreases(68.58%->62.54%) and the RDKafka runtime CPU percentage increases(14.03%->17.04%) 🤔

@BugenZhao

This comment was marked as resolved.

@st1page
Copy link
Contributor

st1page commented Oct 20, 2023

@fuyufjh
Copy link
Member

fuyufjh commented Oct 23, 2023

flamegraph_.zip

@xxchan xxchan self-assigned this Oct 23, 2023
@lmatz
Copy link
Contributor Author

lmatz commented Oct 23, 2023

The flamegraph can be downloaded or directly examined in the Buildkite URL, under the artifacts tab.

@xxchan

This comment was marked as resolved.

@xxchan
Copy link
Member

xxchan commented Oct 24, 2023

Just confirmed that the second drop is caused by rdkafka bump

might take a look at https://github.com/confluentinc/librdkafka/blob/master/CHANGELOG.md
and https://github.com/fede1024/rust-rdkafka/blob/master/changelog.md

387d251's performance is better

note:
## nightly-20231015
- b441341d3ced9cb155de2a1b9a5dabe67a9f7c2b chore(deps): bump rdkafka to v0.34.0 and librdkafka to v2.2.0 (#11232) https://github.com/risingwavelabs/risingwave/pull/11232
- 387d251b722a65dd7147b04aff611a2381c44bf4 feat(optimizer): optimize join stream key (#12831) https://github.com/risingwavelabs/risingwave/pull/12831
- ...
image

Why upgrading rdkafka lowers CPU usage, and also lowers throughput?

image

@xxchan
Copy link
Member

xxchan commented Oct 24, 2023

First regression is #12659 cc @BugenZhao

the 2 points are before/after #12659

image

cpu

image

@BugenZhao
Copy link
Member

First regression is #12659 cc @BugenZhao

But why the flamegraph shows that risingwave takes less CPU in 1009? 😕

@xxchan
Copy link
Member

xxchan commented Oct 25, 2023

I tried to revert #12659 on latest main (https://github.com/risingwavelabs/risingwave/tree/xxchan/revert-12659), in order to see what's the rdkafka's impact on its own, but it seems this also cancelled the impact of rdkafka bump. 🤡

image

@BugenZhao BugenZhao changed the title perf: performance regression perf: kafka source performance regression Oct 26, 2023
@BugenZhao
Copy link
Member

BugenZhao commented Oct 26, 2023

I cannot observe anything useful from the flame graphs provided above, perhaps the files for 10/08 and 10/09 are incorrect?

However, in #13073 I can clearly find tracing::info_span! introduced in #12659 leads to an overhead of 30% (😨).

let parse_span = tracing::info_span!(
"parse_one",
split_id = msg.split_id.as_ref(),
offset = msg.offset
);

image (27% parse, 19% chunk building, 30% tracing)

Why it takes so much CPU time:

The fmt layer in tracing-subscriber eagerly formats the given fields on span creation, possibly because it only receives references to the fields. However, the span is only used to provided more context when a parsing error occurs, which means that in most of time this work is wasted.

https://docs.rs/tracing-subscriber/0.3.17/src/tracing_subscriber/fmt/fmt_layer.rs.html#797-812

@xxchan
Copy link
Member

xxchan commented Oct 26, 2023

I cannot observe anything useful from the flame graphs provided above, perhaps the files for 10/08 and 10/09 are incorrect?

Just noticed that the CPU flamegraphs are actually exchanged.

image image

1008 (actually 1009)'s tracing takes more CPU. And it actually has more samples (rdkafka's samples unchanged, rw's samples increased).
image

1009 (actually 1008)
image

@xxchan xxchan changed the title perf: kafka source performance regression perf: kafka source performance regression due to tracing Oct 26, 2023
@BugenZhao
Copy link
Member

perf: kafka source performance regression due to tracing

So what about #11232?

@xxchan
Copy link
Member

xxchan commented Oct 26, 2023

perf: kafka source performance regression due to tracing

So what about #11232?

According to

I tried to revert #12659 on latest main (xxchan/revert-12659), in order to see what's the rdkafka's impact on its own, but it seems this also cancelled the impact of rdkafka bump. 🤡

I'm not sure whether we need to investigate rdkafka further. 🤡

Let's try to fix tracing first and see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need help from contributors needs-investigation type/perf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants