-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Goroutine leak detection with unit tests #4416
Add Goroutine leak detection with unit tests #4416
Conversation
f9cd1d6
to
e6478ad
Compare
f494457
to
cc28001
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is impressive. I liked the advanced thought in this. A few questions:
- Is it possible to change the constants for timing and thresholds to variables, with the defaults being constants, and they can be overridden via eve config? There may be legitimate scenarios where what the constants consider a problem is perfectly normal operation. We don't want to have to upgrade EVE on devices to change them when a config would do.
- Can we document this?
Yep, I want to do both. I should have done it before converting from Draft =) |
cc28001
to
d0bf014
Compare
065127d
to
04f15cd
Compare
Done! |
04f15cd
to
259c801
Compare
e7b6985
to
f6ccdc1
Compare
…in DumpAllStacks. Modify DumpAllStacks to create the debug directory if it doesn't already exist, ensuring that stack dumps can always be written to the specified location. Signed-off-by: Nikolay Martyanov <[email protected]>
The doc gives a short overview of the compoent and explains the manual garbage collection functionality. Signed-off-by: Nikolay Martyanov <[email protected]>
Introduce a detection system to identify potential goroutine leaks by combining a simple threshold check with trend-based monitoring. This approach periodically tracks the number of active goroutines, checking against a predefined maximum to catch immediate issues. Additionally, it applies a smoothing function to analyze longer-term trends within a sliding window. The detection logic includes: * A basic threshold check: If the goroutine count exceeds a set maximum, an alert is triggered. * A moving average to smooth fluctuations in goroutine counts over time. * Statistical analysis of the rate of change in goroutine counts, setting a dynamic threshold that adapts to typical variations. If the rate of increase consistently surpasses this threshold, it indicates a potential leak. Signed-off-by: Nikolay Martyanov <[email protected]>
f6ccdc1
to
a763555
Compare
9a7e141
to
9bf7913
Compare
Fixed races for the logger in the tests |
Introduce a suite of tests to verify the functionality of the goroutine leak detection system, simulating various scenarios to ensure robust leak identification. Growth scenarios: Test scenarios for initial goroutine growth with stabilization, sudden spikes, and decreases after spikes to validate that these typical patterns do not trigger false positives. Leak scenario: Include tests where the goroutine count gradually increases, simulating a leak. The test verifies that the detector correctly identifies this pattern as a leak. Monitoring tests: Implement tests for the `goroutinesMonitor` function to check detection behavior at regular intervals. Edge cases for the helper function: Add tests to handle empty data inputs, minimal data length, and cases where the moving average window exceeds data length, ensuring resilience against edge cases. Signed-off-by: Nikolay Martyanov <[email protected]>
Document the goroutine leak detection approach, including methods to monitor and identify abnormal increases in goroutines to support proactive system maintenance. Signed-off-by: Nikolay Martyanov <[email protected]>
Added new configuration options for the goroutine leak detector, enabling adjustment of parameters such as threshold, check interval, analysis window, stats retention, and cooldown period through global settings. This enhancement allows for dynamic tuning of leak detection based on deployment needs without code modification. Introduced `GoroutineLeakDetectionParams` struct to manage these settings, along with validation to ensure parameters meet minimum requirements. Updated the `watcher` component to use this configurable setup, and integrated new global config keys to hold these values. Signed-off-by: Nikolay Martyanov <[email protected]>
…ine leak detection. Extended tests to validate that the goroutine leak detection respects updates to configuration values at runtime. Specifically, the tests confirm that adjustments to the `keepStatsFor` parameter dynamically alter the stats slice size, with changes logged as expected. Signed-off-by: Nikolay Martyanov <[email protected]>
Enable the `goroutinesMonitor` to be stopped via a cancellable context within `GoroutineLeakDetectionParams`, allowing controlled termination of the monitoring goroutine. This change introduces a `MakeStoppable` method to set up a cancellable context and a `Stop` method to trigger the cancellation, allowing tests to end monitoring cleanly. Additionally, `checkStopCondition` was added to periodically check if the goroutine should stop. Updated tests to utilize this functionality, adding verification for proper start and stop messages in the log output, ensuring that the monitor operates correctly in both stoppable and unstoppable modes. Signed-off-by: Nikolay Martyanov <[email protected]>
9bf7913
to
9ee63bb
Compare
Suppressed output in the tests, where unnecessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anything use the Stop() function?
Is the intent that setting some ConfigItems to a value of zero will cause a stop and setting that value back to non-zero will cause a Start()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-run tests
It's used in the tests only, not to run the monitor several times in one package. See #4416 (comment) |
Add a step to copy the goroutine leak detector stack dump (sigusr1) from /persist/agentdebug/watcher/ to the collected output directory as `goroutin-leak-detector-stacks-dump`, if present. Signed-off-by: Nikolay Martyanov <[email protected]>
Added the stacks dump file into collect-info if exists. |
This pull request introduces a goroutine leak detection system in the watcher component, along with a comprehensive test suite to validate the accuracy and reliability of leak identification. Additionally, a minor update ensures stack dumps work seamlessly within
watcher
by verifying that the required directory is created as needed.Key Changes:
Goroutine Leak Detection in watcher
The new detection system monitors goroutine counts to identify potential leaks by combining a threshold check for immediate spikes with trend-based analysis. This approach smooths data over time and uses dynamically adjusted thresholds to flag gradual increases, potentially indicating leaks.
Testing for Leak Scenarios
A suite of tests simulates various scenarios including growth patterns, spikes, gradual leaks, and edge cases to ensure the detector accurately distinguishes between normal variations and true leaks.
TODO: