Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store: Add flag ignore-deletion-marks-errors to be able to ignore errors while retrieving deletion marks #7013

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

pettersolberg88
Copy link

@pettersolberg88 pettersolberg88 commented Dec 29, 2023

Store: Add flag ignore-deletion-marks-errors to be able to ignore errors while retrieving deletion marks.

Our S3 implementation (Netapp) have intermittent faults that creates time-outs when querying some non-existent objects.
The IgnoreDeletionMarkFilter queries all metrics blocks for the file deletion-mark.json and when store receives an timeout or other error, it crashes. This flag ignores all fetching errors, and makes store not crash.

Fixes errors like this:

{"caller":"grpc.go:164","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HA07EAKT1YPCMYC6SDHS58S0/deletion-mark.json: Get \"https:<S3-URL>/thanos-metrics/01HA07EAKT1YPCMYC6SDHS58S0/deletion-mark.json\": dial tcp <IP-address>:443: i/o timeout","level":"info","msg":"internal server is shutdown gracefully","service":"gRPC/server","ts":"2023-10-31T07:06:03.987065433Z"}

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • Add flag ignore-deletion-marks-errors to be able to ignore errors while retrieving deletion marks.
  • Log other errors while processing deletion-marks.

Verification

  • Tested against against our s3 implementation. Store do not crash.

…ors while retrieving deletion marks

Signed-off-by: Petter Solberg <[email protected]>
Signed-off-by: Petter Solberg <[email protected]>
@pull-request-size pull-request-size bot added size/M and removed size/L labels Dec 29, 2023
@yeya24
Copy link
Contributor

yeya24 commented Jan 3, 2024

The IgnoreDeletionMarkFilter queries all metrics blocks for the file deletion-mark.json and when store receives an timeout or other error, it crashes.

This doesn't sound right. IO timeout shouldn't crash store

@pettersolberg88
Copy link
Author

pettersolberg88 commented Jan 4, 2024

I Agree that IO timeout should not crash thanos store. For me it seems that the error-handling does not handle timeout properly.

Here is a complete log from thanos-store, that is currently crashlooping. We are running two replicas and both are crashlooping running v0.32.5. And the workaround is to delete the whole chunk.

k logs thanos-store-cold-0 -c thanos-store -p {"caller":"factory.go:53","level":"info","msg":"loading bucket configuration","ts":"2024-01-04T05:46:42.136611622Z"} {"caller":"factory.go:35","level":"info","msg":"loading index cache configuration","ts":"2024-01-04T05:46:42.1370288Z"} {"caller":"memcached.go:71","level":"info","msg":"created index cache","ts":"2024-01-04T05:46:42.137888234Z"} {"caller":"options.go:26","level":"info","msg":"disabled TLS, key and cert must be set to enable","protocol":"gRPC","ts":"2024-01-04T05:46:42.13830578Z"} {"caller":"store.go:520","level":"info","msg":"starting store node","ts":"2024-01-04T05:46:42.139639595Z"} {"caller":"intrumentation.go:75","level":"info","msg":"changing probe status","status":"healthy","ts":"2024-01-04T05:46:42.139735228Z"} {"address":"0.0.0.0:10902","caller":"http.go:73","component":"store","level":"info","msg":"listening for requests and metrics","service":"http/server","ts":"2024-01-04T05:46:42.139776183Z"} {"caller":"store.go:418","level":"info","msg":"initializing bucket store","ts":"2024-01-04T05:46:42.139788709Z"} {"address":"[::]:10902","caller":"tls_config.go:274","component":"store","level":"info","msg":"Listening on","service":"http/server","ts":"2024-01-04T05:46:42.139967167Z"} {"address":"[::]:10902","caller":"tls_config.go:277","component":"store","http2":false,"level":"info","msg":"TLS is disabled.","service":"http/server","ts":"2024-01-04T05:46:42.139987835Z"} {"caller":"intrumentation.go:67","level":"warn","msg":"changing probe status","reason":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","status":"not-ready","ts":"2024-01-04T07:02:52.44066162Z"} {"caller":"http.go:91","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","level":"info","msg":"internal server is shutting down","service":"http/server","ts":"2024-01-04T07:02:52.440757359Z"} {"caller":"intrumentation.go:56","level":"info","msg":"changing probe status","status":"ready","ts":"2024-01-04T07:02:52.440804592Z"} {"address":"0.0.0.0:10901","caller":"grpc.go:131","component":"store","level":"info","msg":"listening for serving gRPC","service":"gRPC/server","ts":"2024-01-04T07:02:52.440867161Z"} {"caller":"http.go:110","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","level":"info","msg":"internal server is shutdown gracefully","service":"http/server","ts":"2024-01-04T07:02:52.44091739Z"} {"caller":"intrumentation.go:81","level":"info","msg":"changing probe status","reason":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","status":"not-healthy","ts":"2024-01-04T07:02:52.440966495Z"} {"caller":"intrumentation.go:67","level":"warn","msg":"changing probe status","reason":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","status":"not-ready","ts":"2024-01-04T07:02:52.441015326Z"} {"caller":"grpc.go:138","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","level":"info","msg":"internal server is shutting down","service":"gRPC/server","ts":"2024-01-04T07:02:52.441033605Z"} {"caller":"grpc.go:151","component":"store","level":"info","msg":"gracefully stopping internal server","service":"gRPC/server","ts":"2024-01-04T07:02:52.44105544Z"} {"caller":"grpc.go:164","component":"store","err":"bucket store initial sync: sync block: filter metas: filter blocks marked for deletion: get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json: Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <ip>:443: i/o timeout","level":"info","msg":"internal server is shutdown gracefully","service":"gRPC/server","ts":"2024-01-04T07:02:52.441094163Z"} {"caller":"main.go:161","err":"Get \"https://<S3-URL>/thanos-metrics/01HH319ACH6FYV9Q28010T93YB/deletion-mark.json\": dial tcp <IP-address-to-S3-provider>:443: i/o timeout get file: 01HH319ACH6FYV9Q28010T93YB/deletion-mark.json github.com/thanos-io/thanos/pkg/block/metadata.ReadMarker \t/app/pkg/block/metadata/markers.go:124 github.com/thanos-io/thanos/pkg/block.(*IgnoreDeletionMarkFilter).Filter.func1 \t/app/pkg/block/fetcher.go:859 golang.org/x/sync/errgroup.(*Group).Go.func1 \t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 filter blocks marked for deletion github.com/thanos-io/thanos/pkg/block.(*IgnoreDeletionMarkFilter).Filter \t/app/pkg/block/fetcher.go:904 github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch \t/app/pkg/block/fetcher.go:475 github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch \t/app/pkg/block/fetcher.go:514 github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks \t/app/pkg/store/bucket.go:556 github.com/thanos-io/thanos/pkg/store.(*BucketStore).InitialSync \t/app/pkg/store/bucket.go:625 main.runStore.func5.1 \t/app/cmd/thanos/store.go:427 github.com/thanos-io/thanos/pkg/runutil.RetryWithLog \t/app/pkg/runutil/runutil.go:97 github.com/thanos-io/thanos/pkg/runutil.Retry \t/app/pkg/runutil/runutil.go:87 main.runStore.func5 \t/app/cmd/thanos/store.go:426 github.com/oklog/run.(*Group).Run.func1 \t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 filter metas github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch \t/app/pkg/block/fetcher.go:476 github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch \t/app/pkg/block/fetcher.go:514 github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks \t/app/pkg/store/bucket.go:556 github.com/thanos-io/thanos/pkg/store.(*BucketStore).InitialSync \t/app/pkg/store/bucket.go:625 main.runStore.func5.1 \t/app/cmd/thanos/store.go:427 github.com/thanos-io/thanos/pkg/runutil.RetryWithLog \t/app/pkg/runutil/runutil.go:97 github.com/thanos-io/thanos/pkg/runutil.Retry \t/app/pkg/runutil/runutil.go:87 main.runStore.func5 \t/app/cmd/thanos/store.go:426 github.com/oklog/run.(*Group).Run.func1 \t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 sync block github.com/thanos-io/thanos/pkg/store.(*BucketStore).InitialSync \t/app/pkg/store/bucket.go:626 main.runStore.func5.1 \t/app/cmd/thanos/store.go:427 github.com/thanos-io/thanos/pkg/runutil.RetryWithLog \t/app/pkg/runutil/runutil.go:97 github.com/thanos-io/thanos/pkg/runutil.Retry \t/app/pkg/runutil/runutil.go:87 main.runStore.func5 \t/app/cmd/thanos/store.go:426 github.com/oklog/run.(*Group).Run.func1 \t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 bucket store initial sync main.runStore.func5 \t/app/cmd/thanos/store.go:432 github.com/oklog/run.(*Group).Run.func1 \t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650 store command failed main.main \t/app/cmd/thanos/main.go:161 runtime.main \t/usr/local/go/src/runtime/proc.go:267 runtime.goexit \t/usr/local/go/src/runtime/asm_amd64.s:1650","level":"error","ts":"2024-01-04T07:02:52.441344799Z"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants