Warning Message when insufficient space to hold multiple persistent queues on a file system could be clearer #14839

robbavey · 2023-01-11T23:21:49Z

** logstash version **

Logstash 7.x >= 7.17.5
Logstash 8.x >= 8.3.0

Steps to reproduce:

Configure multiple pipelines where the total PQ requested is greater than the remaining number of bytes on allocated disk
Start logstash

When starting up, Logstash will check the total amount of space required for PQ's on a specified file system against the amount of disk left on that file system, logging a warning when the total amount of space is exceeded.

However, the warning message emitted is difficult to follow and provide the correct remediating action:

I set up a config on my laptop where I have 312Gi free on my local drive, and configured two pipelines, each with

queue.max_bytes: 300gb

configured and started up logstash.

I received the following warning message:

[2023-01-11T17:43:50,148][WARN ][logstash.persistedqueueconfigvalidator] The persistent queue on path "/Users/robbavey/logstash-8.5.0/data/queue/test" won't fit in file system "/dev/disk1s2" when full. Please free or allocate 643171352576 more bytes. The persistent queue on path "/Users/robbavey/logstash-8.5.0/data/queue/test2" won't fit in file system "/dev/disk1s2" when full. Please free or allocate 643171352576 more bytes.

This number - Please free or allocate 643171352576 more bytes. - feels a little confusing as I actually need fewer bytes than that to successfully allow the PQ's to operate.

The number appears to
(Total Size of required disk across all PQ) - (disk used across all PQ)

But the disk may not be dedicated to PQ and the number may be misleading.

It may be more useful to report

the total number of bytes that logstash requires for PQ storage for the specific file system
the number of bytes actually free on that file system
the number of bytes being used by PQ on that file system
the requirements for each PQ on that file system

It may also be worth strengthening the warning to state that Logstash may fail to start if this is not resolved

The text was updated successfully, but these errors were encountered:

donoghuc · 2024-11-07T16:05:43Z

As i'm still learning these concepts, can I extend your example to check my understanding of this proposed improvement for you to review before moving forward with an implementation?

In your example we configure two pipelines with persistent queues. The relevant config settings are:
PQ1

queue.max_bytes: 300gb
queue.path: /Users/robbavey/logstash-8.5.0/data/queue/test1

PQ2

queue.max_bytes: 300gb
queue.path: /Users/robbavey/logstash-8.5.0/data/queue/test1

The LogStash::PersistetedQueueConfigValidator#check

logstash/logstash-core/lib/logstash/persisted_queue_config_validator.rb

Lines 36 to 66 in 046ea1f

    
           def check(running_pipelines, pipeline_configs) 
        
             # Compare value of new pipeline config and pipeline registry and cache 
        
             has_update = queue_configs_update?(running_pipelines, pipeline_configs) && cache_check_fail?(pipeline_configs) 
        
             @last_check_pipeline_configs = pipeline_configs 
        
             return unless has_update 
        
             warn_msg = [] 
        
             err_msg = [] 
        
             queue_path_file_system = Hash.new # (String: queue path, String: file system) 
        
             required_free_bytes  = Hash.new # (String: file system, Integer: size) 
        
             pipeline_configs.select { |config| config.settings.get('queue.type') == 'persisted'} 
        
                             .select { |config| config.settings.get('queue.max_bytes').to_i != 0 } 
        
                             .each do |config| 
        
               max_bytes = config.settings.get("queue.max_bytes").to_i 
        
               page_capacity = config.settings.get("queue.page_capacity").to_i 
        
               pipeline_id = config.settings.get("pipeline.id") 
        
               queue_path = ::File.join(config.settings.get("path.queue"), pipeline_id) 
        
               pq_page_glob = ::File.join(queue_path, "page.*") 
        
               create_dirs(queue_path) 
        
               used_bytes = get_page_size(pq_page_glob) 
        
               file_system = get_file_system(queue_path) 
        
               check_page_capacity(err_msg, pipeline_id, max_bytes, page_capacity) 
        
               check_queue_usage(warn_msg, pipeline_id, max_bytes, used_bytes) 
        
               queue_path_file_system[queue_path] = file_system 
        
               if used_bytes < max_bytes 
        
                 required_free_bytes[file_system] = required_free_bytes.fetch(file_system, 0) + max_bytes - used_bytes 
        
               end 
        
             end

method is used for computing resource requirements relevant for this warning.

Currently for each of these configs we read in the max_bytes from the config and use the path to determine what filesystem that queue will occupy. In our case both /Users/robbavey/logstash-8.5.0/data/queue/test1 and /Users/robbavey/logstash-8.5.0/data/queue/test2 will occupy /dev/disk1s2. Given we have set the max_bytes greater than the capacity of /dev/disk1s2 we add to the total for that filesystem

logstash/logstash-core/lib/logstash/persisted_queue_config_validator.rb

Lines 63 to 65 in 046ea1f

    
           if used_bytes < max_bytes 
        
             required_free_bytes[file_system] = required_free_bytes.fetch(file_system, 0) + max_bytes - used_bytes 
        
           end

.

There are several shortcomings for this approach:

The filesystem required bytes for PQs only appears to be updated when one of the configs is computed to not have enough space.
We assume that the filesystem is dedicated to just PQ storage which is probably not a likely assumption.
It is not clear how to action this as a consumer of the warning.

Proposed improvement:
Here is a proposed example warning

The `max_bytes` allocated for persistent queues for filesystem '/dev/dist1s2' exceed available space.
'/dev/dist1s2' filesystem status:
- Total space required: 600gb
- Currently free space: 312gb
- Current PQ usage: 50gb
- Additional space needed: 288gb

Individual queue requirements on  '/dev/dist1s2'  filesystem:
 /Users/robbavey/logstash-8.5.0/data/queue/test1:
    Current size: 30gb
    Maximum size: 300gb
  /Users/robbavey/logstash-8.5.0/data/queue/test1:
    Current size: 20gb
    Maximum size: 300gb

Please either:
1. Free up disk space
2. Reduce queue.max_bytes in your pipeline configurations
3. Move PQ storage to a filesystem with more available space
Note: Logstash may fail to start if this is not resolved.

What this would involve is passing through all the configs to build up all the required max_bytes and find all the paths. Once we have this information we can partition by the file systems and build a warning (in our example it is just a single filesystem, but there could be several if paths are configured on multiple filesystems).

Implementation notes: I think from reviewing the existing methods in that class we could compute the space available on a file system as well as the space used under a given queue path. This should allow us to distinguish between what we are explicitly using for logstash queue storage vs what may be used by other entities on the system. I see this util module

logstash/logstash-core/lib/logstash/util/byte_value.rb

Lines 57 to 73 in 046ea1f

    
           def human_readable(number) 
        
             value, unit = if number > PB 
        
               [number / PB, "pb"] 
        
             elsif number > TB 
        
               [number / TB, "tb"] 
        
             elsif number > GB 
        
               [number / GB, "gb"] 
        
             elsif number > MB 
        
               [number / MB, "mb"] 
        
             elsif number > KB 
        
               [number / KB, "kb"] 
        
             else 
        
               [number, "b"] 
        
             end 
        
             format("%.2d%s", value, unit) 
        
           end

which i think would be nicer for human readable bytes.

This commit refactors the `PersistedQueueConfigValidator` class to provide a more detailed, accurate and actionable warning when pipeline's PQ configs are at risk of running out of disk space. See elastic#14839 for design considerations. The highlights of the changes include accurately determining the free resources on a filesystem disk and then providing a breakdown of the usage for each of the paths configured for a queue.

robbavey added bug status:needs-triage int-shortlist labels Jan 11, 2023

roaksoax added Team:Logstash and removed status:needs-triage labels Feb 21, 2023

donoghuc self-assigned this Nov 7, 2024

donoghuc linked a pull request Nov 7, 2024 that will close this issue

Improve warning for insufficient file resources for PQ max_bytes #16656

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning Message when insufficient space to hold multiple persistent queues on a file system could be clearer #14839

Warning Message when insufficient space to hold multiple persistent queues on a file system could be clearer #14839

robbavey commented Jan 11, 2023

donoghuc commented Nov 7, 2024

Warning Message when insufficient space to hold multiple persistent queues on a file system could be clearer #14839

Warning Message when insufficient space to hold multiple persistent queues on a file system could be clearer #14839

Comments

robbavey commented Jan 11, 2023

donoghuc commented Nov 7, 2024