Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsampling] add low-frequency spans sampling policy #36487

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

larsn777
Copy link
Contributor

Description

Hello,
We occasionally encounter the need to sample low-frequency spans. However, as I understand, this type of sampling has not yet been implemented in the tail sampling module.
This PR implements a new sampling policy that will allow us to sample low-frequency spans with a specified level of accuracy.

Main approaches and implementation details:

  1. Span Identifier. To identify the uniqueness or frequency of a specific span within a given set, we need to select a span identifier. I chose the pair of service name + span name as the identifier. In the future, the choice of fields for this identifier could be made configurable within the sampling policy.
  2. Count-Min Sketch (CMS). To count occurrences of unique spans in a data stream, we need to select an appropriate data structure. A naive solution might be a simple hash table, using the span identifier as the key and an occurrence counter as the value. However, with a high variability of span identifiers, such a table could become too large. As an alternative, I chose the Count-Min Sketch data structure, which generally reduces the memory required for counting span identifiers. Additionally, users can adjust the configuration parameters of this policy to find a trade-off between counting accuracy and resource usage.
  3. Sliding CMS Window. Counting low-frequency spans should happen within a limited time frame (e.g., an hour, half-hour, or minute), as otherwise, we would consume significant resources and likely lose accuracy with CMS over time. A simple approach would be to create a new CMS structure at the beginning of each interval, populate it with data, and then start over with a fresh CMS at the beginning of the next interval. However, this would reset counters at each interval start, leading to a surge of sampled spans at the beginning of each interval. To avoid this, we can use a sliding CMS window. The entire interval (window) is divided into N equal sub-intervals, each with its own CMS instance. As the window slides forward by one sub-interval, we delete the CMS instance for the earliest sub-interval and create a new one. To get the count of span occurrences, we sum occurrences across all sub-intervals and add the new occurrence to the CMS for the latest sub-interval. This approach significantly smooths out the number of spans sampled at interval boundaries.
  4. Limits on Processing and Sampling. As mentioned in the previous point, the sliding window smooths out spikes in sampled spans at interval boundaries, but it doesn’t completely eliminate them, as a large portion of unique spans will likely fall into the first sub-interval, causing minor spikes at sub-interval boundaries. To manage boundary values for sub-intervals, we can set a limit on the number of spans processed and sampled per second.

Testing

Almost all the new code is covered by unit tests.

Documentation

A brief description of the new policy has been added to the module's README. Additionally, an example configuration has been included in the README.

@larsn777 larsn777 requested review from jpkrohling and a team as code owners November 21, 2024 15:21
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants