Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws] [s3] Introduce ignore_older & start_timestamp for S3 input allowing better registry cleanups #41817

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Kavindu-Dodan
Copy link
Contributor

@Kavindu-Dodan Kavindu-Dodan commented Nov 27, 2024

Proposed commit message

Introduce ignore_older and start_timestamp properties to AWS S3 input. This is a follow-up for #41694.

The configurations introduced here act as input object filters. If the object fails to match derived filters, the entries will be cleaned up from the registry, reducing filebeat memory consumption.

Introduced configurations are,

  • ignore_older : Accepts a time duration in which entries are accepted for processing
  • start_timestamp: A timestamp from which objects are accepted for processing

For both inputs, the object's last modified timestamp is taken into comparison. See Use cases section for further explanation

Note - a follow-up for #41694. Hence diff contains all changes

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

None as defaults are disabled. However, when configurations introduced here are used, the following can have an impact on the user,

  • Whenstart_timestamp is defined, then objects with the last modified timestamps prior to the timestamp are ignored from processing (documented 1)
  • When ignore_older is defined, then objects that do not fall within the look-back period when processing started (polling run) are ignored (documented 1)
  • When both start_timestamp & ignore_older are defined, the initial run will process all entries up to start_timestamp. The subsequent runs will not include entries that do not fall within ignore_older even if processing failed for an object. (documented 1)

How to test this PR locally

  • Build filebeat from the changest included in the PR
  • Source S3 bucket with objects (you may use this tool 2 to create entries)
  • Try configuring filebeat with alternative values for ignore_older & start_timestamp to see how data ingestion change with their values. See Use cases section for further explanation

Related issues

Use cases

Consider below diagrams where there're 3 objects Object A, Object B and Object C with their respect last modified timestamps t1, t2 and t3.

And consider how filebeat processes and track registry entries based on following scnearios

Default behavior

If none of the configurations used, then filebeat will process and internal registry will track all objects continuously unless they are removed from the bucket.

image

Use start_timestamp

If start_timestamp is used, then objects newer than the timestamp are accepted for processing. The registry will grow unless objects are removed from the bucket.

image

Use ignore_older

If ignore_older is defined, input will process objects within the provided duration, calculated from the current time. The registry will track objects within the current timeframe and others will get cleaned up eventually by subsequent runs.

image

Use both ignore_older & start_timestamp

If both properties are defined,

  • The initial run will include entries that fall within the start_timestamp (ignoring ignore_older duration).
  • Subsequent runs will only consider entries that fall within the ignore_older duration.

image

Footnotes

  1. https://github.com/elastic/beats/pull/41817/files#diff-422765b7341c5bbf6de7af38927e34e00a5073b188585a7af3c4fee1175b64a6R574-R597 2 3

  2. https://github.com/Kavindu-Dodan/data-gen

@Kavindu-Dodan Kavindu-Dodan added enhancement Team:obs-ds-hosted-services Label for the Observability Hosted Services team backport-8.x Automated backport to the 8.x branch with mergify labels Nov 27, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 27, 2024
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch 2 times, most recently from 4924d70 to 79ae2c1 Compare November 27, 2024 22:32
@@ -115,6 +115,7 @@ filebeat.inputs:
- Add support to source AWS cloudwatch logs from linked accounts. {pull}41188[41188]
- Jounrald input now supports filtering by facilities. {pull}41061[41061]
- Add support to include AWS cloudwatch linked accounts when using log_group_name_prefix to define log group names. {pull}41206[41206]
- AWS S3 input registry cleanup for untracked s3 objects. {pull}41694[41694]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo - fix this when #41694 is merged. This is here as changes are based on #41694

Signed-off-by: Kavindu Dodanduwa <[email protected]>

# Conflicts:
#	x-pack/filebeat/input/awss3/states.go
#	x-pack/filebeat/input/awss3/states_test.go
Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>
Signed-off-by: Kavindu Dodanduwa <[email protected]>
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/s3-input-start-time-and-ignore-old branch from 52fad61 to 6f5472c Compare December 3, 2024 23:06
@elastic elastic deleted a comment from mergify bot Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws-s3 input's bucket polling accumulates state in the registry
1 participant