Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve UX to increase likelihood of bids being received (or improve per node #287

Open
anilmurty opened this issue Aug 2, 2024 · 2 comments
Labels
P2 priority 2 feature/ enhancement

Comments

@anilmurty
Copy link

anilmurty commented Aug 2, 2024

Problem Statement

There are situations where Console users hit the GPU pricing page (on the website) or the providers page (in console), see that there are enough "available" GPUs of the desired model, proceed to deploy via console, only to NOT get ANY bids for their deployment. This can happen due to the following primary reasons:

  • While there may be enough "available" GPUs in aggregate (across multiple providers), there may not be enough GPUs on a single provider
  • While there may be enough GPUs on a single provider, there aren't enough (to fulfill the gpu count in the user SDL) on a single node of the provider. This can happen if past small requests (1-2 GPUs per deployment) happened to get scheduled across different nodes of the provider, leaving the provider "fragmented" in terms of available GPUs.
  • While there may be enough GPUs on a single node to satisfy the gpu count, the specific node may not have enough other (non-GPU) resources available to satisfy all the resource requirements outlined in the compute profile. We have sometimes seen this happen when a provider's CPU count gets maxed out (90%) with work loads while they have little usage of GPUs.

Solution(s)

The solution requires some deeper thought and brainstorming but here are some initial thoughts and approaches. Note that the ideal solution is one that prevents the issue from occurring in the first place but an improvement (to the current experience) is a solution that provides information to the user about the (apparent) discrepancy and/ or prevents them from requesting bids for workloads that are not likely to get any

  1. Resources Per Node: Providing per node counts for GPUs or at least max available on any node of the provider in https://console.akash.network/providers -- This would be a column called "Max requestable per deployment" or something. Alternatively, it could be a filter on the table that lets the user specify the count they intend to request and it filters and shows the providers that have >= that count available on any node.

  2. Quick Check before initiating deployment: Implementing a "Quick Check" button on the SDL builder page that the user can click, which will run a query to return if there are any providers that can meet the needs, while indicating which recommending which resource should be reduced to increase the number of bids received. Note that the reason for doing it here (rather than in the bids stage, is because the user can adjust the resources here while to do that once the deployment is created requires closing the existing deployment and starting a new one)

  3. Better per node "bin packing": Improving the way pods get scheduled so that the nodes with least amount of resources available to meet the request are prioritized over ones that have greater amount of resources -- essentially a "bin packing strategy" of sorts. This would be much more involved -- TBD if this is something that can be achieved by customizing the behavior of the default kube-scheduler (https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1/#kubescheduler-config-k8s-io-v1-KubeSchedulerConfiguration) or if it would require implementing a custom scheduler. Another consideration (besides implementation complexity) is what this would do to the time it takes for the scheduler to find a node to schedule (since it may have to go through the entire cluster). One thing to look into is the Scoring Strategy (https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1/#kubescheduler-config-k8s-io-v1-ScoringStrategy)

@anilmurty
Copy link
Author

anilmurty commented Aug 7, 2024

@devalpatel67 @andy108369 - we talked about this issue and approach #3 independently. Turns out Andrey was suggested updating the k8s scheduler policy a while back as well (https://github.com/ovrclk/engineering/issues/320) - I'm wondering if we should try this on a test cluster and see how it goes. Specifically: we will be testing out a new provider (VP) for h100s soon (likely next week) so I was thinking in addition to testing throughput and IO performance we can maybe test changing this and seeing if it optimizes per node utilization better than out current k8s config does, without causing any other issues.

cc'ing @troian @boz and @chainzero for their thoughts as well. This is something I think that will cause our H100 providers to be underutilized if we have a bunch of users deploy SDLs with 1-2 H100s. And in the extreme case, we will have trouble getting bids for a Llama-3.1-405B type deployment (that needs 8x H100s) even if there are enough GPUs available on the provider as whole but not enough on a single node.

@dominikusbrian
Copy link

From the end-user perspective the proposed solution No. 2 "Quick Check before initiating deployment" is great one. Exactly what I was think would be nice to have when faced with deployments that never seems to got proper bids from available providers. Most of the time asking for too much CPUs is the culprit, other times the storage request size or type is the issue. So having the suggestion on what in particular need to be reduced for more receiving more bids will be nice.

Implementation wise, maybe the feedback optimization could simply start with, more bid -> better ?
Because surely the recommendation system need a way to sort of know if the suggestion is actually getting user closer to actually accepting some bid from providers and deploying successfully. Just to be sure, this query and quick check will be off-chain? so no transaction fees for every query cycle? User will then only charged once when they actually shoot up the real deployment initiation.

Would love to see this coming to shape and eventually be a feature in Console 2.0.

@github-project-automation github-project-automation bot moved this to Backlog (not prioritized) in Core Product and Engineering Roadmap Aug 20, 2024
@baktun14 baktun14 added the P2 priority 2 feature/ enhancement label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 priority 2 feature/ enhancement
Projects
Status: Backlog (not prioritized)
Development

No branches or pull requests

3 participants