Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blockbuilder: more tests #9314

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

blockbuilder: more tests #9314

wants to merge 2 commits into from

Conversation

narqo
Copy link
Contributor

@narqo narqo commented Sep 17, 2024

What this PR does

This one seats atop #9199 for now

This is part of #8635; refer to it for more details.

Here we backport the test cases for how block-builder handles the out-of-order samples.

Also, the PR fixes a flaky TestBlockBuilder_StartWithLookbackOnNoCommit test, by making sure the test waits for the correct outcome.

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@narqo narqo force-pushed the vldmr/bb-upstream-tests branch from 748286a to 19833e8 Compare September 26, 2024 20:50
@narqo narqo changed the title blockbuilder: test out-of-order samples and records blockbuilder: more tests Sep 26, 2024
@narqo narqo force-pushed the vldmr/bb-upstream-tests branch from 19833e8 to b0c8533 Compare September 27, 2024 09:38
@narqo narqo marked this pull request as ready for review September 27, 2024 09:40
@narqo narqo requested a review from a team as a code owner September 27, 2024 09:40
cortex_blockbuilder_consumer_lag_records{partition="0"} 0
cortex_blockbuilder_consumer_lag_records{partition="1"} 0
`), "cortex_blockbuilder_consumer_lag_records"))
}, 30*time.Second, 100*time.Millisecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: here (and other similar changes), it's fine to wait that long. These assertions are terminal, so the test should not process if the outcome doesn't "eventually" happen.

t.Run("future record", func(t *testing.T) {
// The sample from above which was in-order but the kafka record was in future
// should get consumed in this cycle. The other sample that is still in the future should not be consumed.
cycleEnd = cycleEnd.Add(cfg.ConsumeInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing this ourselves skips testing the BB logic which does it. Don't we want to test that too?

Can we let the BB run until it has filled the bucket with some data (i.e. we expect 3 blocks; do something like Eventually(func() {bucket.countBlocks() == 3}) or until it updates some metrics. Then we can assert on what each block contains with tsdb.OpenBlock() instead of tsdb.Open

Thoughts?

Copy link
Contributor Author

@narqo narqo Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the point, sorry. Here, what we do is, effectively, move the wall clock forward. It's not that we "skip" any of the block-builder's logic. We only point it to a portion of the partition, where the cycle's data represent what the test case tests (note that we explicitly trigger the nextConsumeCycle in these tests).

We would get the same if we started the block-builder and let it run for several hours. Over the course of multiple cycle hours, it'd scanned over all test data in the partition, and tested the blocks produced on each cycle. Without mocking block-builder's clock, that's not ideal, of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would get the same if we started the block-builder and let it run for several hours

yes, but we'd exercise the production code logic. Right now the logic for calculating cycleEnd is in both the tests and the prod code.

but your point about waiting for hours also makes sense. Maybe smaller blocks can solve this? like 2-3-second long blocks with a much smaller ConsumeInterval? A small test is better than no test, so i don't want to block on this

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test, I like it! 👏 I left a couple of minor comments. No need for me to re-review it.

Comment on lines +211 to +212
require.Eventually(t, func() bool {
return assert.NoError(t, promtest.GatherAndCompare(reg, strings.NewReader(`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work as expected? If the assert.NoError() fail at least once, isn't it tracked anyway as a failure by the testing library? I'm wondering if that you really want is require.EventuallyWithT() which was designed for this specific use case.

return assert.NoError(t, promtest.GatherAndCompare(reg, strings.NewReader(`
# HELP cortex_blockbuilder_consumer_lag_records The per-topic-partition number of records, instance needs to work through each cycle.
# TYPE cortex_blockbuilder_consumer_lag_records gauge
cortex_blockbuilder_consumer_lag_records{partition="0"} 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assert on cortex_blockbuilder_consumer_lag_records being 0 as a success condition to consider the block builder has done. Typically metrics are initialised by 0, so 0 could also mean "no cycle has started yet". It's not a real issue here because of a technicality: this metric is defined as prometheus.GaugeVec and they're not initialized with 0 by default (prometheus.Gauge is).

I'm wondering if when we start the block builder we should initialise the cortex_blockbuilder_consumer_lag_records metric for all owned partition to the value of -1 to clearly signal block building hasn't started yet.

@@ -287,6 +280,121 @@ func TestBlockBuilder_WithMultipleTenants(t *testing.T) {
}
}

func TestBlockBuilder_WithOutOfOrderRecordsAndSamples(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] This test assumes ConsumeInterval is 1h, which is the default, but it may change in the future. To keep this test stable, I would suggest to override ConsumeInterval to 1h in the code, and add a comment to explain that's an assumption of the test, regardless what the default value will be in the future.

Should we do the same for ConsumeIntervalBuffer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants