-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Ensure workers and miq_worker rows match #22771
base: master
Are you sure you want to change the base?
[WIP] Ensure workers and miq_worker rows match #22771
Conversation
4f32ebc
to
f368402
Compare
# Update worker deployments with updated settings such as cpu/memory limits | ||
sync_deployment_settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO I think this has to be after cleanup_orphaned_worker_rows
, and doesn't make sense IMO to have sync_deployment_settings
in sync_from_system
since this is supposed to just be "find what is out there"
f368402
to
0d79355
Compare
2d55105
to
e6e1c39
Compare
@@ -50,6 +46,9 @@ def cleanup_orphaned_worker_rows | |||
end | |||
end | |||
|
|||
def cleanup_orphaned_workers | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment in here just to describe why this is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for the other "empty" methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just isn't implemented yet, I'm testing with Process first
end | ||
|
||
def sync_starting_workers | ||
MiqWorker.find_all_starting.to_a | ||
end | ||
|
||
def cleanup_orphaned_worker_rows | ||
orphaned_rows = miq_workers.where.not(:pid => miq_pids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use system_uid column in kubernetes - is that column populated with pid in process mode? If so, may be worth using the same column for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't set for process or systemd but we should do that for consistency. That's going to be another PR but we can probably get that in first and make use of it here.
The first 10 failures are from awesome_spawn update. Think this one is reasonable:
|
@kbrock yeah this isn't close to done yet I expect the tests to fail right now |
Reran after merge of #22772 |
b11a008
to
697d8b8
Compare
@agrare I don't remember the details why but I recall worker rows are created by the server in spawn/systemd but in the pod itself when it invokes run_single_worker.rb. In other words, for a moment we have a row without a process in the spawn/systemd case and a pod without a row in the podified case. I don't know if this change or my change I tested earlier this week would cause a problem with this existing assumption. In my case, I was attempting to reload the worker row before each heartbeat and let the process exit if the row was removed. diff --git a/app/models/miq_worker/runner.rb b/app/models/miq_worker/runner.rb
index 6b7ed69f3d..c8dd3ffd9c 100644
--- a/app/models/miq_worker/runner.rb
+++ b/app/models/miq_worker/runner.rb
@@ -319,6 +319,7 @@ class MiqWorker::Runner
# Heartbeats can be expensive, so do them only when needed
return if @last_hb.kind_of?(Time) && (@last_hb + worker_settings[:heartbeat_freq]) >= now
+ reload_worker_record
systemd_worker? ? @worker.sd_notify_watchdog : heartbeat_to_file
if config_out_of_date? |
@jrafanie is a great point and something we need to be careful about, kubernetes creates the worker rows in run_single_worker because we can't know the GUID prior to starting the pod (all pods in a replica set have to have the same environment). Kubernetes was already deleting all worker rows that didn't match any running pods. For the systemd/process model I believe we're okay based on where in the monitor loop we are, by the time we are checking for orphan rows the worker records will have been created and the process or systemd service will have been started. For kubernetes there could be an issue with the reverse where the pod is created but extremely slow and the worker record hasn't been created yet (NOTE I haven't implemented this one yet). I think if we include the "starting timeout' in the check based on the age of the pod then we should be able to cover that case, wdyt? |
Yeah, that could work. We've seen it before where pods are really slow to start in our 🔥 environments and such as when CPU throttling is in play. Anything we do should account for possibly significant delays in both situations:
|
This pull request has been automatically marked as stale because it has not been updated for at least 3 months. If these changes are still valid, please remove the |
697d8b8
to
bcd5b98
Compare
Checked commit agrare@bcd5b98 with ruby 2.7.8, rubocop 1.56.3, haml-lint 0.51.0, and yamllint app/models/miq_server/worker_management/kubernetes.rb
|
This pull request has been automatically marked as stale because it has not been updated for at least 3 months. If these changes are still valid, please remove the |
This pull request is not mergeable. Please rebase and repush. |
This pull request has been automatically marked as stale because it has not been updated for at least 3 months. If these changes are still valid, please remove the |
It is possible for the workers and the miq_worker rows to get out of sync
#22644