Use of ** wildcard in log paths causes significant/excessive CPU usage #1443

gilesp-pageup · 2024-11-26T06:22:59Z

Describe the bug
This bug occurs when using the default configuration for Cloudwatch Application Insights for a SQL Server workload on a Windows EC2 instance.
When configured with the default path, the Cloudwatch agent will consume the full CPU resources of a single core (eg on a four core xlarge type instance, the agent will consume 25% of available CPU resources)

Default log path: C:\Program Files\Microsoft SQL Server\MSSQL**.MSSQLSERVER\MSSQL\Log\ERRORLOG

Steps to reproduce
(Note, this was tested in ap-southeast-2)

Launch a base Windows 2022 with SQL Server 2022 instance (eg. ami-0e9ec1af20499fd81)
Add the instance to a resource group, then configure Cloudwatch Application Insights as follows:
Navigate to Cloudwatch - Application Insights in the AWS console and select 'Add an application'.
Specify the previously configured resource group containing the test instance. Untick irrelevant options (eg Monitor EventBridge events) then click Next.
Remove all but SQL Server from detected components and continue.
Leave the default settings for SQL Server, including the above default log path (C:\Program Files\Microsoft SQL Server\MSSQL**.MSSQLSERVER\MSSQL\Log\ERRORLOG)
Review settings, then submit to create resources for Application Insights.
Monitor test instance, note increase in CPU utilization from amazon-cloudwatch-agent.exe
Modify Application Insights log path configuration to remove use of the wildcard (eg. C:\Program Files\Microsoft SQL Server\MSSQL16.MSSQLSERVER\MSSQL\Log\ERRORLOG), either by modifying the Application Insights setting, or by locating and modifying the SSM Parameter that Application Insights has created.
Once the SSM Parameter has been updated, ensure the state manager association has reapplied the new configuration to the test instance. Note the drop in CPU usage (amazon-cloudwatch-agent.exe will stop and restart with new configuration when association is applied)

Specifying more restrictive wildcards appears to mitigate the issue (eg using C:\Program Files\Microsoft SQL Server\MSSQL1[0-9].MSSQLSERVER\MSSQL\Log\ERRORLOG to cover versions from 2008 forward until whenever SQL Server version 20 turns up)

What did you expect to see?
Sensible CPU utilization from the Cloudwatch agent when specifying log paths that include ** wildcards, particularly when the ** wildcard still only matches a single directory within C:\Program Files\Microsoft SQL Server.

What did you see instead?
High CPU utilization from the agent, consuming the entire resources of a core on the affected instance.

What version did you use?
Version: (e.g., v1.247350.0, etc)
1.300049.1b929
Note: This particular issue has been present since at least December 2019, prior to the agent source being available to the public.

What config did you use?
Config:

{"logs":{"logs_collected":{"windows_events":{"collect_list":[{"event_name":"Application","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-Application-testtest","event_format":"xml"},{"event_name":"System","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-System-testtest","event_format":"xml"},{"event_name":"Security","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-Security-testtest","event_format":"xml"},{"event_name":"Application","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-Application-testtest","event_format":"xml"},{"event_name":"System","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-System-testtest","event_format":"xml"},{"event_name":"Security","event_levels":["WARNING","ERROR","CRITICAL"],"log_group_name":"WINDOWS_EVENTS-Security-testtest","event_format":"xml"}]},"files":{"collect_list":[{"file_path":"C:\\Program Files\\Microsoft SQL Server\\MSSQL**.MSSQLSERVER\\MSSQL\\Log\\ERRORLOG","log_group_name":"SQL_SERVER-testtest","encoding":"utf-16"}]}}},"metrics":{"append_dimensions":{"InstanceId":"${aws:InstanceId}"},"metrics_collected":{"Processor":{"measurement":[{"name":"% Processor Time","unit":"Percent"}],"metrics_collection_interval":60,"resources":["*"]},"SQLServer:Buffer Manager":{"measurement":[{"name":"Buffer cache hit ratio","unit":"Percent"},{"name":"Page life expectancy","unit":"Seconds"}],"metrics_collection_interval":60,"resources":["*"]},"LogicalDisk":{"measurement":[{"name":"% Free Space","unit":"Percent"}],"metrics_collection_interval":60,"resources":["*"]},"Paging File":{"measurement":[{"name":"% Usage","unit":"Percent"}],"metrics_collection_interval":60,"resources":["*"]},"Memory":{"measurement":[{"name":"% Committed Bytes In Use","unit":"Percent"},{"name":"Available Mbytes","unit":"Megabytes"}],"metrics_collection_interval":60,"resources":["*"]},"PhysicalDisk":{"measurement":[{"name":"% Disk Time","unit":"Percent"}],"metrics_collection_interval":60,"resources":["*"]},"SQLServer:General Statistics":{"measurement":[{"name":"Processes blocked","unit":"Count"},{"name":"User Connections","unit":"Count"}],"metrics_collection_interval":60,"resources":["*"]},"SQLServer:SQL Statistics":{"measurement":[{"name":"Batch Requests/sec","unit":"Count"}],"metrics_collection_interval":60,"resources":["*"]},"Network Interface":{"measurement":[{"name":"Bytes Total/sec","unit":"Bytes"}],"metrics_collection_interval":60,"resources":["*"]},"System":{"measurement":[{"name":"Processor Queue Length","unit":"Count"}],"metrics_collection_interval":60,"resources":["*"]},"SQLServer:Locks":{"measurement":[{"name":"Number of Deadlocks/sec","unit":"Count"}],"metrics_collection_interval":60,"resources":["*"]}},"aggregation_dimensions":[["InstanceId"]]}}

Environment
OS: Windows 2022 with SQL Server 2022
(Has previously affected Windows 2016 and SQL Server 2017, likely affects all versions of Windows)
No exotic environmental conditions are required to trigger this behaviour. It will occur on base AWS provided AMIs such as the example provided above.

Additional context
This is similar to, but not quite the same as issue #807. The problem is not related to the number of log files being ingested, rather in this case is related to the use of wildcard matching part way down the directory tree to attempt to support multiple potential versions of SQL Server. It does not matter whether additional matched paths exist or not, the CPU utilization gets spiked when the path is specified with wildcards. The same behaviour can be observed by choosing a path which won't exist (or have subdirectories) such as changing MSSQL** to MSSQLFAKE** within the path.

The only way to mitigate this behaviour is to not use ** as the wildcard on the path. Replacing it with ?? or 1[0-9] to match a specific range of potential paths works around the issue specifically for SQL Server, but this does not address other potential workloads which have different, less rigid directory naming conventions.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of ** wildcard in log paths causes significant/excessive CPU usage #1443

Use of ** wildcard in log paths causes significant/excessive CPU usage #1443

gilesp-pageup commented Nov 26, 2024 •

edited

Loading

Use of ** wildcard in log paths causes significant/excessive CPU usage #1443

Use of ** wildcard in log paths causes significant/excessive CPU usage #1443

Comments

gilesp-pageup commented Nov 26, 2024 • edited Loading

gilesp-pageup commented Nov 26, 2024 •

edited

Loading