Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of ** wildcard in log paths causes significant/excessive CPU usage #1443

Open
gilesp-pageup opened this issue Nov 26, 2024 · 0 comments
Open

Comments

@gilesp-pageup
Copy link

gilesp-pageup commented Nov 26, 2024

Describe the bug
This bug occurs when using the default configuration for Cloudwatch Application Insights for a SQL Server workload on a Windows EC2 instance.
When configured with the default path, the Cloudwatch agent will consume the full CPU resources of a single core (eg on a four core xlarge type instance, the agent will consume 25% of available CPU resources)

Default log path: C:\Program Files\Microsoft SQL Server\MSSQL**.MSSQLSERVER\MSSQL\Log\ERRORLOG

Steps to reproduce
(Note, this was tested in ap-southeast-2)

  • Launch a base Windows 2022 with SQL Server 2022 instance (eg. ami-0e9ec1af20499fd81)

  • Add the instance to a resource group, then configure Cloudwatch Application Insights as follows:

  • Navigate to Cloudwatch - Application Insights in the AWS console and select 'Add an application'.

  • Specify the previously configured resource group containing the test instance. Untick irrelevant options (eg Monitor EventBridge events) then click Next.

  • Remove all but SQL Server from detected components and continue.

  • Leave the default settings for SQL Server, including the above default log path (C:\Program Files\Microsoft SQL Server\MSSQL**.MSSQLSERVER\MSSQL\Log\ERRORLOG)

  • Review settings, then submit to create resources for Application Insights.

  • Monitor test instance, note increase in CPU utilization from amazon-cloudwatch-agent.exe

  • Modify Application Insights log path configuration to remove use of the wildcard (eg. C:\Program Files\Microsoft SQL Server\MSSQL16.MSSQLSERVER\MSSQL\Log\ERRORLOG), either by modifying the Application Insights setting, or by locating and modifying the SSM Parameter that Application Insights has created.

  • Once the SSM Parameter has been updated, ensure the state manager association has reapplied the new configuration to the test instance. Note the drop in CPU usage (amazon-cloudwatch-agent.exe will stop and restart with new configuration when association is applied)

Specifying more restrictive wildcards appears to mitigate the issue (eg using C:\Program Files\Microsoft SQL Server\MSSQL1[0-9].MSSQLSERVER\MSSQL\Log\ERRORLOG to cover versions from 2008 forward until whenever SQL Server version 20 turns up)

What did you expect to see?
Sensible CPU utilization from the Cloudwatch agent when specifying log paths that include ** wildcards, particularly when the ** wildcard still only matches a single directory within C:\Program Files\Microsoft SQL Server.

What did you see instead?
High CPU utilization from the agent, consuming the entire resources of a core on the affected instance.

What version did you use?
Version: (e.g., v1.247350.0, etc)
1.300049.1b929
Note: This particular issue has been present since at least December 2019, prior to the agent source being available to the public.

What config did you use?
Config:

{"logs":{"logs_collected":{"windows_events":{"collect_list":[{"event_name":"Application","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-Application-testtest","event_format":"xml"},{"event_name":"System","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-System-testtest","event_format":"xml"},{"event_name":"Security","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-Security-testtest","event_format":"xml"},{"event_name":"Application","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-Application-testtest","event_format":"xml"},{"event_name":"System","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-System-testtest","event_format":"xml"},{"event_name":"Security","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-Security-testtest","event_format":"xml"}]},"files":{"collect_list":[{"file_path":"C:\\Program Files\\Microsoft SQL Server\\MSSQL**.MSSQLSERVER\\MSSQL\\Log\\ERRORLOG","log_group_name":"SQL_SERVER-testtest","encoding":"utf-16"}]}}},"metrics":{"append_dimensions":{"InstanceId":"${aws:InstanceId}"},"metrics_collected":{"Processor":{"measurement":[{"name":"% Processor Time","unit":"Percent"}],"metrics_collection_interval":60,"resources":["*"]},"SQLServer:Buffer Manager":{"measurement":[{"name":"Buffer cache hit ratio","unit":"Percent"},{"name":"Page life expectancy","unit":"Seconds"}],"metrics_collection_interval":60,"resources":["*"]},"LogicalDisk":{"measurement":[{"name":"% Free Space","unit":"Percent"}],"metrics_collection_interval":60,"resources":["*"]},"Paging File":{"measurement":[{"name":"% Usage","unit":"Percent"}],"metrics_collection_interval":60,"resources":["*"]},"Memory":{"measurement":[{"name":"% Committed Bytes In Use","unit":"Percent"},{"name":"Available Mbytes","unit":"Megabytes"}],"metrics_collection_interval":60,"resources":["*"]},"PhysicalDisk":{"measurement":[{"name":"% Disk Time","unit":"Percent"}],"metrics_collection_interval":60,"resources":["*"]},"SQLServer:General Statistics":{"measurement":[{"name":"Processes blocked","unit":"Count"},{"name":"User Connections","unit":"Count"}],"metrics_collection_interval":60,"resources":["*"]},"SQLServer:SQL Statistics":{"measurement":[{"name":"Batch Requests/sec","unit":"Count"}],"metrics_collection_interval":60,"resources":["*"]},"Network Interface":{"measurement":[{"name":"Bytes Total/sec","unit":"Bytes"}],"metrics_collection_interval":60,"resources":["*"]},"System":{"measurement":[{"name":"Processor Queue Length","unit":"Count"}],"metrics_collection_interval":60,"resources":["*"]},"SQLServer:Locks":{"measurement":[{"name":"Number of Deadlocks/sec","unit":"Count"}],"metrics_collection_interval":60,"resources":["*"]}},"aggregation_dimensions":[["InstanceId"]]}}

Environment
OS: Windows 2022 with SQL Server 2022
(Has previously affected Windows 2016 and SQL Server 2017, likely affects all versions of Windows)
No exotic environmental conditions are required to trigger this behaviour. It will occur on base AWS provided AMIs such as the example provided above.

Additional context
This is similar to, but not quite the same as issue #807. The problem is not related to the number of log files being ingested, rather in this case is related to the use of wildcard matching part way down the directory tree to attempt to support multiple potential versions of SQL Server. It does not matter whether additional matched paths exist or not, the CPU utilization gets spiked when the path is specified with wildcards. The same behaviour can be observed by choosing a path which won't exist (or have subdirectories) such as changing MSSQL** to MSSQLFAKE** within the path.

The only way to mitigate this behaviour is to not use ** as the wildcard on the path. Replacing it with ?? or 1[0-9] to match a specific range of potential paths works around the issue specifically for SQL Server, but this does not address other potential workloads which have different, less rigid directory naming conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant