[processor/tailsampling] add low-frequency spans sampling policy #36487

larsn777 · 2024-11-21T15:21:24Z

Description

Hello,
We occasionally encounter the need to sample low-frequency spans. However, as I understand, this type of sampling has not yet been implemented in the tail sampling module.
This PR implements a new sampling policy that will allow us to sample low-frequency spans with a specified level of accuracy.

Main approaches and implementation details:

Span Identifier. To identify the uniqueness or frequency of a specific span within a given set, we need to select a span identifier. I chose the pair of service name + span name as the identifier. In the future, the choice of fields for this identifier could be made configurable within the sampling policy.
Count-Min Sketch (CMS). To count occurrences of unique spans in a data stream, we need to select an appropriate data structure. A naive solution might be a simple hash table, using the span identifier as the key and an occurrence counter as the value. However, with a high variability of span identifiers, such a table could become too large. As an alternative, I chose the Count-Min Sketch data structure, which generally reduces the memory required for counting span identifiers. Additionally, users can adjust the configuration parameters of this policy to find a trade-off between counting accuracy and resource usage.
Sliding CMS Window. Counting low-frequency spans should happen within a limited time frame (e.g., an hour, half-hour, or minute), as otherwise, we would consume significant resources and likely lose accuracy with CMS over time. A simple approach would be to create a new CMS structure at the beginning of each interval, populate it with data, and then start over with a fresh CMS at the beginning of the next interval. However, this would reset counters at each interval start, leading to a surge of sampled spans at the beginning of each interval. To avoid this, we can use a sliding CMS window. The entire interval (window) is divided into N equal sub-intervals, each with its own CMS instance. As the window slides forward by one sub-interval, we delete the CMS instance for the earliest sub-interval and create a new one. To get the count of span occurrences, we sum occurrences across all sub-intervals and add the new occurrence to the CMS for the latest sub-interval. This approach significantly smooths out the number of spans sampled at interval boundaries.
Limits on Processing and Sampling. As mentioned in the previous point, the sliding window smooths out spikes in sampled spans at interval boundaries, but it doesn’t completely eliminate them, as a large portion of unique spans will likely fall into the first sub-interval, causing minor spikes at sub-interval boundaries. To manage boundary values for sub-intervals, we can set a limit on the number of spans processed and sampled per second.

Testing

Almost all the new code is covered by unit tests.

Documentation

A brief description of the new policy has been added to the module's README. Additionally, an example configuration has been included in the README.

[processor/tailsampling] add low-frequency spans sampling policy

8e0beb6

larsn777 requested review from jpkrohling and a team as code owners November 21, 2024 15:21

github-actions bot assigned mx-psi Nov 21, 2024

github-actions bot added the processor/tailsampling Tail sampling processor label Nov 21, 2024

larsn777 added 2 commits November 22, 2024 00:21

[processor/tailsampling] low-frequency spans policy: fix linting

1bb273c

[processor/tailsampling] low-frequency spans policy: fix changelog

1ccb5a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/tailsampling] add low-frequency spans sampling policy #36487

[processor/tailsampling] add low-frequency spans sampling policy #36487

larsn777 commented Nov 21, 2024

[processor/tailsampling] add low-frequency spans sampling policy #36487

Are you sure you want to change the base?

[processor/tailsampling] add low-frequency spans sampling policy #36487

Conversation

larsn777 commented Nov 21, 2024

Description

Testing

Documentation