You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding the SM allocation in MuxServe. In your paper, you mentioned:
... parallel runtime dynamically assigns SMs to each job at runtime rather than statically allocating ...
Does this imply that the SM configuration for each process is adjusted on demand during the inference of incoming requests? If that’s the case, could you please explain how this is accomplished? Is it done by setting the CUDA set_active_thread_percentage <PID> <percentage> from time to time for each process?
The text was updated successfully, but these errors were encountered:
Hello Team,
Thank you for your outstanding work!
I have a question regarding the SM allocation in MuxServe. In your paper, you mentioned:
Does this imply that the SM configuration for each process is adjusted on demand during the inference of incoming requests? If that’s the case, could you please explain how this is accomplished? Is it done by setting the CUDA
set_active_thread_percentage <PID> <percentage>
from time to time for each process?The text was updated successfully, but these errors were encountered: