You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The FuncX manager seems like it is continually trying to spin up more workers than the map allows for. I know this because I'm using the "pin to accelerator" functionality, which causes an error if you attempt to allocate more than the number of available workers.
To Reproduce
TBD I don't have a minimal case of this happening yet.
Expected behavior
The manager not writing an error message every few seconds.
Environment
OS: Ubuntu
Python version @ client: 3.8.12
Python version @ endpoint: 3.8.12
funcx version @ client: 0.3.9
funcx-endpoint version @ endpoint: 0.3.9.dev0
Distributed Environment
Where are you running the funcX script from? Laptop
Where does the endpoint run? Workstation
What is your endpoint-uuid? acdb2f41-fd86-4bc7-a1e5-e19c12d3350d
1660838271.649775 2022-08-18 10:57:51 ERROR MainProcess-2266833 MainThread-140719713703616 funcx_endpoint.executors.high_throughput.worker_map:181 spin_up_workers Error spinning up worker! Skipping...
Traceback (most recent call last):
File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 339, in add_worker
device = self.available_accelerators.get_nowait()
File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 169, in spin_up_workers
proc = self.add_worker(
File "/lus/theta-fs0/projects/CSC249ADCD08/edw/env-gpu/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/worker_map.py", line 341, in add_worker
raise ValueError(
ValueError: No accelerators are available. New worker must be created only after another is removed
I didn't write something important when creating this issue: the manager fails to spin up new workers until all other workers have exited. That's almost certainly related
Describe the bug
The FuncX manager seems like it is continually trying to spin up more workers than the map allows for. I know this because I'm using the "pin to accelerator" functionality, which causes an error if you attempt to allocate more than the number of available workers.
To Reproduce
TBD I don't have a minimal case of this happening yet.
Expected behavior
The manager not writing an error message every few seconds.
Environment
Distributed Environment
manager.log
The text was updated successfully, but these errors were encountered: