Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Ensure workers and miq_worker rows match #22771

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

agrare
Copy link
Member

@agrare agrare commented Nov 2, 2023

It is possible for the workers and the miq_worker rows to get out of sync

#22644

@agrare agrare added the bug label Nov 2, 2023
@agrare agrare requested a review from jrafanie as a code owner November 2, 2023 15:25
@agrare agrare added the wip label Nov 2, 2023
@agrare agrare force-pushed the ensure_workers_and_worker_rows_match branch from 4f32ebc to f368402 Compare November 2, 2023 15:26
Comment on lines 15 to 16
# Update worker deployments with updated settings such as cpu/memory limits
sync_deployment_settings
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO I think this has to be after cleanup_orphaned_worker_rows, and doesn't make sense IMO to have sync_deployment_settings in sync_from_system since this is supposed to just be "find what is out there"

@agrare agrare force-pushed the ensure_workers_and_worker_rows_match branch from f368402 to 0d79355 Compare November 2, 2023 15:27
@miq-bot miq-bot removed the wip label Nov 2, 2023
@agrare agrare force-pushed the ensure_workers_and_worker_rows_match branch 2 times, most recently from 2d55105 to e6e1c39 Compare November 2, 2023 15:40
@@ -50,6 +46,9 @@ def cleanup_orphaned_worker_rows
end
end

def cleanup_orphaned_workers
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment in here just to describe why this is empty?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for the other "empty" methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just isn't implemented yet, I'm testing with Process first

end

def sync_starting_workers
MiqWorker.find_all_starting.to_a
end

def cleanup_orphaned_worker_rows
orphaned_rows = miq_workers.where.not(:pid => miq_pids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use system_uid column in kubernetes - is that column populated with pid in process mode? If so, may be worth using the same column for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't set for process or systemd but we should do that for consistency. That's going to be another PR but we can probably get that in first and make use of it here.

@kbrock
Copy link
Member

kbrock commented Nov 2, 2023

The first 10 failures are from awesome_spawn update.

Think this one is reasonable:

  11) MiqServer::WorkerManagement::Kubernetes#sync_from_system #ensure_kube_monitors_started podified, ensures pod monitor started and orphaned rows are removed
      Failure/Error: expect(server.worker_manager).to receive(:cleanup_orphaned_worker_rows)

        (#<MiqServer::WorkerManagement::Kubernetes:0x0000564498f5f478 @my_server=#<MiqServer id: 88000000002459, guid: "4ae48e49-aef7-4e8c-9c53-361dc1a662a5", status: "started", started_on: "2023-11-02 15:58:19.285129921 +0000", stopped_on: nil, pid: nil, build: nil, percent_memory: nil, percent_cpu: nil, cpu_time: nil, name: "miq_server_0000000002315", last_heartbeat: "2023-11-02 15:58:19.285110421 +0000", os_priority: nil, is_master: false, logo: nil, version: "9.9.9.9", zone_id: 88000000004083, memory_usage: nil, memory_size: nil, hostname: nil, ipaddress: nil, drb_uri: nil, mac_address: nil, vm_id: nil, has_active_userinterface: nil, has_active_webservices: nil, sql_spid: nil, log_file_depot_id: nil, proportional_set_size: nil, has_active_remote_console: nil, system_memory_free: nil, system_memory_used: nil, system_swap_free: nil, system_swap_used: nil, has_active_cockpit_ws: nil, unique_set_size: nil, has_vix_disk_lib: nil>, @workers_lock=#<Sync:0x0000564498f5f388 @sync_mode=:UN, @sync_waiting=[], @sync_upgrade_waiting=[], @sync_sh_locker={}, @sync_ex_locker=nil, @sync_ex_count=0, @sync_mutex=#<Thread::Mutex:0x0000564498f5f1d0>>, @workers={}, @queue_messages_lock=#<Sync:0x0000564498f5f068 @sync_mode=:UN, @sync_waiting=[], @sync_upgrade_waiting=[], @sync_sh_locker={}, @sync_ex_locker=nil, @sync_ex_count=0, @sync_mutex=#<Thread::Mutex:0x0000564498f5eed8>>, @queue_messages={}>).cleanup_orphaned_worker_rows(*(any args))
            expected: 1 time with any arguments
            received: 0 times with any arguments
      # ./spec/models/miq_server/worker_management/kubernetes_spec.rb:125:in `block (4 levels) in <top (required)>'

@agrare agrare changed the title Ensure workers and miq_worker rows match [WIP] Ensure workers and miq_worker rows match Nov 2, 2023
@agrare
Copy link
Member Author

agrare commented Nov 2, 2023

@kbrock yeah this isn't close to done yet I expect the tests to fail right now

@miq-bot miq-bot added the wip label Nov 2, 2023
@Fryguy Fryguy closed this Nov 2, 2023
@Fryguy Fryguy reopened this Nov 2, 2023
@Fryguy
Copy link
Member

Fryguy commented Nov 2, 2023

Reran after merge of #22772

@agrare agrare force-pushed the ensure_workers_and_worker_rows_match branch 3 times, most recently from b11a008 to 697d8b8 Compare November 2, 2023 20:03
@jrafanie
Copy link
Member

jrafanie commented Nov 3, 2023

@agrare I don't remember the details why but I recall worker rows are created by the server in spawn/systemd but in the pod itself when it invokes run_single_worker.rb. In other words, for a moment we have a row without a process in the spawn/systemd case and a pod without a row in the podified case. I don't know if this change or my change I tested earlier this week would cause a problem with this existing assumption.

In my case, I was attempting to reload the worker row before each heartbeat and let the process exit if the row was removed.

diff --git a/app/models/miq_worker/runner.rb b/app/models/miq_worker/runner.rb
index 6b7ed69f3d..c8dd3ffd9c 100644
--- a/app/models/miq_worker/runner.rb
+++ b/app/models/miq_worker/runner.rb
@@ -319,6 +319,7 @@ class MiqWorker::Runner
     # Heartbeats can be expensive, so do them only when needed
     return if @last_hb.kind_of?(Time) && (@last_hb + worker_settings[:heartbeat_freq]) >= now

+    reload_worker_record
     systemd_worker? ? @worker.sd_notify_watchdog : heartbeat_to_file

     if config_out_of_date?

@agrare
Copy link
Member Author

agrare commented Nov 6, 2023

@jrafanie is a great point and something we need to be careful about, kubernetes creates the worker rows in run_single_worker because we can't know the GUID prior to starting the pod (all pods in a replica set have to have the same environment). Kubernetes was already deleting all worker rows that didn't match any running pods.

For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started.

For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt?

@jrafanie
Copy link
Member

jrafanie commented Nov 6, 2023

For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started.

For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt?

Yeah, that could work. We've seen it before where pods are really slow to start in our 🔥 environments and such as when CPU throttling is in play. Anything we do should account for possibly significant delays in both situations:

@miq-bot
Copy link
Member

miq-bot commented Feb 19, 2024

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).

@agrare agrare force-pushed the ensure_workers_and_worker_rows_match branch from 697d8b8 to bcd5b98 Compare March 18, 2024 16:33
@miq-bot
Copy link
Member

miq-bot commented Mar 18, 2024

Checked commit agrare@bcd5b98 with ruby 2.7.8, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
5 files checked, 2 offenses detected

app/models/miq_server/worker_management/kubernetes.rb

@miq-bot miq-bot added the stale label Jun 24, 2024
@miq-bot
Copy link
Member

miq-bot commented Jun 24, 2024

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).

@miq-bot
Copy link
Member

miq-bot commented Aug 16, 2024

This pull request is not mergeable. Please rebase and repush.

@miq-bot
Copy link
Member

miq-bot commented Nov 18, 2024

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants